IT IS BETTER TO PREPARE AND PREVENT THAN TO REPAIR AND REPENT
That quote fits perfectly well for our topic - Chaos Engineering is a preventive mechanism.
Have you come across headlines that said "Customers are reporting difficulty in accessing mobile and web applications", "Website not working" , "Service Unavailable" etc? These messages of unpredictability are occurring on a regular frequency.
Why so? What's missing?
The answer to handling these unpredictability is "Chaos Engineering"
What is Chaos Engineering?
Chaos engineering defines the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. The goal of chaos engineering is to identify weaknesses in a system through controlled experiments/tests that introduce random and unpredictable behavior.
While the goal of "Software Engineering" is to put the capabilities into production, the goal of "Chaos Engineering" is to ensure that the product does not run into failures. It is a disciplined approach to identifying failures before they become outages.
Need for Chaos Engineering
As the businesses and digital footprints are growing so are the cost of downtimes. One single downtime of few mins could cost a business millions of dollars.
With the rise of microservices architecture and distributed cloud systems, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.
These outages & failures are often in complex and distributed systems, where several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust.
The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Star, refers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure.
In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours.
Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be.
A main benefit of chaos engineering is that organizations can use it to identify vulnerabilities before a hacker does or before a system failure. Changes made as a result of chaos engineering testing increase confidence in an organization's systems.
Process of Chaos Engineering
Chaos engineering experiments intentionally generate turbulent conditions in a distributed system to test the system and find weaknesses. Some of the key areas of a chaos experiment include:
- Blind spots. Areas that monitoring software cannot gather adequate data.
- Hidden bugs. Glitches that are not caught in testing and can cause software to malfunction.
- Performance bottlenecks. Situations where efficiency and performance could be improved.
Chaos engineering is similar to stress testing in that it aims to identify and correct system or network issues. Unlike stress testing, chaos engineering doesn't test and correct one component at a time but it covers the system as a whole.
The process is typically divided into several steps:
- Set the baseline. Start by establishing a baseline. The testers must identify how the system should operate under optimal conditions and specify what constitutes a normal working state.
- Create a hypothesis. Consider one or more potential weaknesses and formulate a hypothesis about the effects of those weaknesses. For example, software testers might want to know what will happen if a large traffic spike occurs.
- Test. Conduct experiments to gauge the consequences of a large spike. The experiments might reveal an error in a critical process or an unexpected cause-and-effect relationship. For example, a traffic spike simulation might reveal a storage performance issue.
- Evaluate. Measure and evaluate how the hypothesis holds up and determine which problems to fix.
Chaos engineers use the given below fallacies of distributed computing as core principles:
- The network is reliable.
- There is zero latency.
- Bandwidth is infinite.
- The network is secure.
- Topology never changes.
- There is one admin.
- Transport cost is zero.
- The network is homogeneous.
But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behavior, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time.
Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.
Examples of Chaos Engineering
There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.
Below is a list of the most common chaos tests:
- Simulating the failure of a micro-component and dependency.
- Simulating a high CPU load and sudden increase in traffic.
- Simulating failure of entire AZ(Availability Zone) or region.
- Injecting latency and byzantine failures in services.
- Exhausting memory on instances(cloud services) and allowing fault injection.
- Causing Host Failure.
Tools for Chaos Engineering
- Chaos Monkey: It is a tool that is used to test the resilience of the system. It works by disabling one system of production and testing how other remaining systems respond to the outage. It is designed to test system stability by enforcing failures and later on checking the response of the system.
"Imagine a monkey entering a 'data centre', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."
Latency Monkey: This is useful in testing fault tolerance of service by creating communication delays to provoke outages in the network.
Doctor Monkey: It checks the health status as well as other components related to health of the system i.e. CPU load to detect unhealthy instances and eventually fixing the instance.
Conformity Monkey: It finds the instance that doesn't adhere to best practices against a set of rules and sends an email notification to the owner of the instance.
Janitor Monkey: Ensures cloud service is working free of unused resources and clutter. Disposes of any waste.
Security Monkey: It is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances.
Chaos Gorilla: It is similar to Chaos Monkey, but drops full Availability Zone while testing.
Advantages of Chaos Engineering
- Insights received after running chaos testing can lead to a reduction in production incidents for the future.
Through Chaos Engineering, the team can verify the system's behaviour on failure so that accordingly it takes action.
Chaos Engineering helps in the testing response of the team to the incident. Also, helps in testing if the raised alert has been notified to the correct team.
On a high level, Chaos Engineering provides us an advantage by overall system availability. Chaos Experiments make the system more resilient to failures.
Production outages can lead to huge losses to companies depending on the usage of the system, therefore chaos engineering helps in the prevention of large losses in revenue.
It helps in improving the confidence and engagement of team members for carrying out disaster recovery methods and makes applications highly reliable.
Disadvantages of Chaos Engineering
- Implementing Chaos Monkey for a large-scale system and experimenting can lead to an increase in cost.
Carelessness or Incorrect steps in formation and implementation can impact the application, thereby hampering the customer.
While implementing the project, it doesn't provide any Interface to track and monitor. It runs through scripts and configuration files.
It doesn't support all kinds of deployment.
In the present scenario of Software Development, chaos engineering has become a magnificent tool which can help organizations to not only improve resiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes.