What is Chaos Engineering?
Before we get into what is chaos engineering and about chaos engineering tools let us understand the impact of application failures. Organizations are looking at increasing revenue, business growth, and operational excellence by scaling their digital capabilities. In the present tech-savvy world, random glitches in systems have become harder to predict and nearly impossible to afford by companies. These random failures impact a company’s bottom line, making downtime a key performance indicator for engineers. These glitches can be a networking glitch in one of the data centers, a misconfigured server configuration, an unexpected failure of a node, or any other kind of failure that propagates across systems. These outages usually bring catastrophic results and impact an organization from a financial and reputation standpoint.
One single hour of an outage can cost millions of dollars to a company. As per Gartner, the average cost of IT downtime is $5,600 per minute. Since there is a difference in how each business operates, the cost of downtime can vary between $140,000 per hour to $540,000 per hour. As organizations cannot wait for an outage to happen, one should look at proactively identifying system weaknesses and applying chaos engineering practices to mitigate the risks.
Chaos Engineering studies how large scale systems respond to all the random events. It is a disciplined approach to identify failures before they become outages. By testing how a system responds under stress, engineers can quickly identify and fix faults. The frequency of releases to production has increased drastically, but it is important to maintain applications’ reliability by adhering to SLA’s like application availability and customer satisfaction. The ultimate purpose behind chaos engineering is to limit the chaos behind outages caused by random events by carefully investigating ways to make a system more robust. Traditional reliability engineering practices of incident management, service recovery procedures, etc., may not deliver the required outcomes to minimize the impact of failures. While practicing chaos engineering, planned experiments are performed on the systems to check the system’s response when such a situation occurs. According to Gartner, by the year 2024, more than 50% of large enterprises will utilize chaos engineering practice against their digital capabilities to approach 99.999% availability.
Originally, Chaos Engineering was Netflix’s rationale as they needed to be resilient against random host failures while migrating to AWS (Amazon Web Services). This resulted in the release of Chaos Monkey by Netflix in the year 2010. In the year 2011, the Simian Army added additional failure injections on top of Chaos Monkey that allowed testing of more states of failures and building resilience. Netflix also decided to introduce a new role called Chaos engineering in the year 2014. Then, Gremlin announced Failure Injection Testing (FIT) tool built on the Simian Army concepts to build resilience in the systems against random events. With many organizations moving into cloud and microservice architecture, the need for chaos engineering has increased in recent years. Many larger technology companies like Amazon, Netflix, LinkedIn, Facebook, Microsoft, Google, and a few others are happily practicing Chaos Engineering to improve their systems’ reliability.
Chaos Engineering Principles
Chaos Engineering works on the principle of running thoughtful experiments within the system, which brings out insights on how the system responds in case of failures. The chaos engineering processes are similar to how a flu vaccine works. A flu vaccine stimulates your body’s immune system to generate antibodies that will help to attack the flu virus. There are three steps involved: –
To begin with the process, an application team consisting of architects, developers, testers, and support engineers to prioritize a few things. The first step is to identify a fault that can be injected and hypothesize on the expected outcome by mapping IT or Business metrics. One may need to look at finding answers to questions like “What could go wrong,” “What if X component fails,” “What will happen if my server runs out of memory” to arrive at possible scenarios. One will have to approach this with a bit of pessimism to improve the overall scenario coverage. One should create a hypothesis backlog that includes details on how the application will fail, the impact, measurement criteria, restoration procedures, etc. For creating the hypothesis backlog, techniques like brainstorming and analysis of incident logs can be adopted. The backlog items can be further prioritized based on the likelihood of occurrence and impact of failure as it might be practically impossible to invest time and budget to avoid all types of failures.
It involves the execution of an experiment to measure the parameters around the availability and resilience of a system like service level, mean time to repair, etc. The experiments are focused on creating a failure by increasing CPU Utilization or inducing a DNS outage.
During the initial stages of chaos engineering implementation, the experiments are performed on a sandbox or in a pre-production environment. It is also important to restrict the blast radius to minimize the impact of an experiment on the application. As confidence improves, the blast radius can be improved, and one can move the experiments to a production environment.
One may need to document the plan for each experiment that would include
a) Steady State measurement
b) Activities that you will perform to trigger a failure
c) Activities that will take to monitor the application
d) Measurements to analyze the impact of the failure
e) Actions to roll back the system to a steady-state
This is the last step and determines the success of the experiments. The experiments are halted if there is an impact on the metrics, and the failures are analyzed. The chaos experiment is considered successful only if a failure occurs. The changes required in the application, if any, are also added to the product backlog. The experiments are repeated by increasing the blast radius if the system is found to be resilient.
After completing the experiment, the insights obtained provide information on the system’s real-world behavior during random failures. This helps engineering teams to fix issues or define roll back plans. Introducing Chaos Engineering in the organization brings in both business as well as technical benefits. For the business, Chaos Engineering helps prevent significant losses in overall revenue, improves the incident management response, and improves on-call training for engineer teams and the resiliency of the systems. From the technical point of view, data obtained from Chaos experiments results in increased understanding of system failure modes, improved system design, reduction in repeated incidents, and on-call burden.
Chaos Engineering Tools
Many tools are available in the market for letting companies practice Chaos Engineering. Chaos Monkey, Gremlin Inc., Simian Army, Jepsen, Spinnaker are a few famous tools to name, easily implemented in the organization. Using Jepson on a distributed system, you can easily identify chaos events like killing components, network issues, and generating random load. At the same time, Chaos Monkey will randomly terminate instances in production to improve the services’ resilience implemented to instant failures. Similarly, other tools mentioned also have a particular way to experiment and improve the products’ resilience. Depending on your requirements and budget, you can use any of them. Organizations can also build their own Chaos Engineering tools using code from open source tools. The process may be time-consuming and expensive but gives complete control over the tool, options to customize it, and more security.
One should not look at chaos engineering as a one-time activity performed on an application as applications are undergoing frequent changes to meet the demand from the business and end consumers. The possibility of vulnerability that was previously fixed to resurface is also high, and it is important to validate the application by implementing continuous chaos tests. The team can create a regression pack comprising of prioritized chaos experiments that can be used to validate the resiliency of the system. If fully automated can be integrated along with the DevOps pipeline, these experiments can be executed as part of the weekly build to identify failures early in the life cycle.
Predicting system failures have become difficult due to complex application architectures. As the cost of downtime is high, the organization should take a proactive approach to prevent crashes by applying chaos engineering practices. Organizations should implement chaos engineering as part of the DevOps, invest in chaos engineering tools, and improve competency to improve application reliability.