Chaos engineering is a discipline where you experiment on your system or application to reveal its weaknesses and capacity failure. These are something that you did not think could happen while creating it. So, you would cause some failures on purpose on your system to show up its weaknesses to make the fixes and make your system and your application more resilient. Many popular organizations like Netflix, LinkedIn, and Facebook perform chaos engineering to better understand their microservices architecture and distributed systems. It helps in finding new issues sooner than real user complaints and take necessary action to correct them. That’s how these organizations can serve millions of users, increase their productivity, and save millions of dollars 🤑. Benefits of Chaos Engineering:
Control losses on revenue by finding critical issuesReduction in system or application failureBetter user experience with less disruption and high service availabilityIt helps you learn about the system and gain confidence.
How confident are you about your production reliability? Is it real disaster-proof? Let’s find out with the help of the following popular chaos testing tools.
Chaos Mesh
Chaos Mesh is a chaos engineering management solution that injects faults into every layer of a Kubernetes system. This includes pods, the network, system I/O, and the kernel. Chaos Mesh can automatically kill Kubernetes pods and simulate latencies. It can disrupt pod-to-pod communication and simulate read/write errors. It can schedule rules for the experiments and define their scope. These experiments are specified using YAML files. Chaos Mesh has a dashboard to view analytics on experiments. It runs on top of Kubernetes and supports the majority of the cloud platform. It is open-source and was recently accepted as a CNCF sandbox project. Using chaos engineering principles, you can add Chaos Mesh to your DevOps workflow to build resilient applications. Chaos Engineering features:
Easily deployable on Kubernetes clusters with no modification in deployment logicNo unique dependencies are required for deploymentDefines chaos objects using CustomResourceDefinitions (CRD)Provides a dashboard to track all the experiments
Chaos ToolKit
Chaos ToolKit is an open-source and simple tool for Chaos Engineering Experiment Automation. You integrate Chaos ToolKit with your system using a set of drivers or plugins it supports AWS, Google Cloud, Slack, Prometheus, etc. Chaos ToolKit features:
Provides declarative Open API to create chaos experiments independent of a vendor or technologyCan be easily embedded in CICD pipelines for automationProvides commercial and enterprise support also through ChaosIQ
ChaosKube
As you can guess by the name, it for Kubernetes. Chaoskube is an open-source chaos tool that kills random pods periodically in the Kubernetes cluster. It helps you understand how your system will react when the pod fails. By default, it kills a pod in any namespace every 10 minutes. You can filter the target pods in Chaoskube using namespaces, labels, annotations, etc. It can be easily installed using Chaoskube.
Chaos Monkey
Chaos Monkey is a tool used to check the resilience of the cloud systems by purposely creating failures for those systems to understand their reaction. Netflix created it to test its AWS infrastructure resiliency and recoverability. It was named Chaos Monkey because it creates destruction like a wild and armed monkey to test the failures. Also, it was Chaos Monkey, which gave birth to the new engineering practice Chaos Engineering. It was created on the principle that it is better to fail repeatedly to avoid any significant failure suddenly. Chaos Monkey features:
It helps you prepare for random instance failures.Encourages redundancy for unexpected failuresUses Spinnaker to enable cross-cloud compatibilityProvides configurable schedule to simulate failuresIntegrated with govendor to add any new dependencies to chaos monkey
Simmy
Simmy is a fault-injection chaos tool that integrates with the Polly resilience project for .NET. It allows you to create chaos-injection policies through Polly, where you execute your codes. It offers different policies such as exceptions policy to inject exceptions in the system, behavior policy to inject any new behavior, etc. These policies are designed to inject the behavior randomly. Simmy features:
Provides Monkey policies or Chaos policies to inject chaosEasy to test any dependency failuresIt helps to revert to the working model quickly and controls the blast radius.It is production-grade ready.It can define failures based on external factors also (for example, failures due to global configuration)
Pystol
Pystol is a tool that is used for injecting faulty injections in cloud-native environments. It watches events in the ETCD through Kubernetes operators. When a fault injection action is executed, the operators create the pods and run some Ansible collections. So, developers need not write their own actions to perform. Pystol provides ready-made actions to test the system. Still, if a developer wants to create a new action, it can be done using GoLang and Python. It provides a continuous integration dashboard to give a summary view of all the job operations. You can run Pystol locally or deploy it in a container using its docker image. Pystol provides two interfaces, one is Web UI, and the other one is through CLI. Obviously, Web UI is a better option.
Muxy
Muxy is a proxy to test your resilience and fault tolerance patterns for real-world distributed system failures. It can tamper with transport level (layer 4), TCP session level (layer 5), and HTTP protocol level (layer 7). Muxy features:
Modular architecture and easily extensibleHas official docker containerEasy to install, no dependencies required.Ideal for continuous testing of resilienceSimulates network connectivity issues for distributed systems and mobile devices
Pumba
Pumba is a command-line tool that performs chaos testing for docker containers. With Pumba, you purposely crash the application’s docker containers to see how the system reacts. You can also perform stress testing on the container resources such as CPU, memory, file system, input/output, etc. You can also run Pumba on a Kubernetes cluster. You have to use DaemonSets to deploy Pumba on Kubernetes nodes. You can use multiple Pumba containers to run multiple Pumba commands in the same DaemonSet.
ChaosBlade
ChaosBlade is an open-source tool to inject experiments into the systems by Alibaba. It tests all the failures Alibaba has faced in the last ten years and applies best practices to avoid them. It follows chaos engineering principles to check the fault tolerance of distributed systems. ChaosBlade features:
Provides experimental scenarios for multiple resources such as CPU, network, memory, disk, etc.Provides experimental scenarios for nodes, networks, and pods on the Kubernetes platformProvides easy-to-use CLI commands to execute experiments
Litmus
Litmus follows cloud-native chaos engineering principles. The litmus tool’s mission is to deliver a complete framework for finding weaknesses in your Kubernetes systems and your running applications on Kubernetes. It has a chaos Operator and the CRDs (CustomResourceDefinitions) around that, allowing plug-and-play capability. It’s all about putting your chaos logic into a docker image, throwing it into a litmus framework, and getting them orchestrated using the CRDs. Litmus features:
Helps Site Reliability engineers and developers to find weaknesses in the Kubernetes systemProvides ready-to-use generic experimentsProvides Chaos API for chaos workflow managementLitmus SDK supports Go, Python, and Ansible to create your own experiments.
Gremlin
Gremlin helps engineers build more resilient software. It provides a platform to run chaos engineering experiments safely, securely, and straightforwardly. You can thoughtfully inject failure into hosts or containers with gremlin regardless of where they are, whether that’s the public cloud or your own data center. Gremlin features:
Installs lightweight agent on your hosts or containers to inject failuresProvides 10+ different infrastructure attack modesState gremlins let you manipulate system time, shut down or restart hosts and kill processors.Network gremlins can inject latency to introduce packet loss or drop the traffic.Gremlin’s Alfi library attacks can be configured, started, and stopped via the web app. API or CLIAllows you to target the blast radius you want to attack preciselyAllows you to halt all attacks and roll the system back to a steady-state
Steadybit
Steadybit aims to reduce downtime proactively and provides visibility into system issues. You can run this tool locally on your infrastructure or cloud as a service (SaaS). To use Steadybit, you define the situation, simulate the experiments, execute the simulated experiments on production, and automate all the experiments. It runs intelligent agents on your system to discover potential issues and weaknesses. It integrates with multiple systems with ease.
Conclusion
Go ahead and be brave enough to apply chaos engineering principles and test your production with the abovementioned tools. These tools will help you find multiple unidentified weaknesses in your system, and it will help you make your system more resilient.