Chaos Engineering In Action – Part 1

No alt text provided for this image

Above image shows the actual response when people hear about “Chaos Engineering”. Let me first tell you about what is Chaos Engineering and how it is going to change dimension and paradigm in IT engineering and specially in micro service architecture where it is very hard to observe and build resilience system without service mesh tool setup.

No alt text provided for this image

Chaos Engineering is originally originated in Netflix where they practiced Chaos Monkey tool that they used to kill random Instances under AWS ASG(Auto Scaling) to test system resiliency.They open sourced this tool and conveyed benefits of Chaos engineering to industry and we saw many companies started development around this concept and you will see many similar tools around this topic.

But now this new engineering domain is started taking shape as we can see many opensource tools around this. There are many but right approach is taken by Chaos Toolkit project where they are developing it as standard and using modular approach where any one can add new plugins/module to handle Chaos around specific platform.

For example to create Chaos on Kubernetes there is module name Chaostoolkit-kubernetes, for Pivotal Cloud Foundry there is module/plugin named chaostoolkit-cloud-foundry and similar module/plugins are available for major platform like AWS(chaostoolkit-aws), Google cloud etc.

No alt text provided for this image

Following are the major features of Chaos toolkit that can be utilized by any organization.

No alt text provided for this image

ChaosIQ contributed some good stuff in form of ChaosHub(Now ChaosPlatform) and they are trying to build enterprise solution around this concept and contributed their code to chaostoolkit project. ChaosHub new name is ChaosPlatform and is a solution that organization can use to track all action taken by chaos toolkit to track back everything about execution logs, failure logs etc at central place.

No alt text provided for this image

Chaos Engineering Phases:

Following are the 5 phase of Chaos engineering where you majorly work on first 3 to define steady state and create hypothesis around Chaos failure that you would like to introduce to your application/platform and design experiment around it like there is an example on chaostoolkit github that expires one of the SSL cert and shows results accordingly and you can easily play around it for your learning.

No alt text provided for this image

What is Chaostoolkit?

Chaos Toolkit is a project whose mission is to provide a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs.

Why the Chaos Toolkit?

The Chaos Toolkit has two main purposes:

  • To provide a full chaos engineering implementation that simplifies the adoption of chaos engineering by providing an easy starting point for applying the discipline.
  • To define an open API with the community so that any chaos experiment can be executed consistently using integration with the many commercial, private and open source chaos implementations that are emerging.

How to use Chaostoolkit to execute chaos on any platform?

Write experiment file in JSON containing 3 states of system according to principle of Chaos engineering and execute/run those experiments like

#chaos run experiement.json

No alt text provided for this image

There are many tools those are poping up daily in this domain but some of latest and famous tools are listed below and there are > 50 tools in this domain but following mentioned tools are more relative to latest technology stack that you can pick according to your requirement but keep eye on features like Chaos Monkey only limited to AWS ASG where as ChaosToolkit can be run against any platform and some providers has started adding application knowledge in Chaos failures and now Gremlin has added application level Chaos in their library.Below image is little outdated during T-Mobile demo available on You Tube. I will be adding more intro to Gremlin in my other blog as it is one of the leading vendor in Chaos domain for application integration.

No alt text provided for this image

•SimianArmy

•Chaos Monkey(aws asz)

•Kube Monkey(k8s)

•ChaosToolkit(For all platform)

•ChaosPlatform(ChaosHub) (Tracing Chaos)

•Istio(for any micro service on k8s/ocp)

•PowerfulSeal(for Kubernetes POD’s) by BloomsBerg

•Goldpinger from BloomsBerg for K8s graph

Future of Chaos Engineering:

Believe me it is one the upcoming technology tag(#chaos #chaosengineering) that you will find every where in near future and going to part of each SRE JD and early implementation will give competitive edge over other organization as those who will adapt it early will become mature early and will be in capacity to tell resiliency and reliability of their system and will become more confident on their solution and platform that will lead to more market share.

We also started seeing new jobs around this position like Netflix hires for Reliability or Resilience Engineer and some of the leading companies has started asking for Chaos requirements to train their internal staff and soon you will find new org. structure around Chaos engineering.

References:

  1. chaostoolkit.slack.com (Slack Channel)
  2. https://chaosconf.splashthat.com/ (Upcoming Conference)
  3. https://principlesofchaos.org/ (Original Principles)
  4. Chaosengineering.slack.com(slack Channel)
  5. https://chaostoolkit.org/ (ChosTookit/ github)

Some of the people say that their product is not stable and at current state they don’t want to break their system and don’t want to start with Chaos Engineering but i will say “keep following quote in your mind and re think about your statement“.

No alt text provided for this image

I will recommend to join ChaoEngeering and ChaosToolkit slack channels to observe whats happening around this domain.There is lot to write on this topic so i will be covering those in my upcoming blogs on it so keep eye on Chaos Engineering In Action – Part 2.