IT IS BETTER TO PREPARE AND PREVENT THAN TO REPAIR AND REPENT
That quote fits perfectly well for our topic - Chaos Engineering is a preventive mechanism.
Have you come across headlines that said "Customers are reporting difficulty in accessing mobile and web applications", "Website not working" , "Service Unavailable" etc? These messages of unpredictability are occurring on a regular frequency.
Why so? What's missing?
The answer to handling these unpredictability is "Chaos Engineering"
Chaos engineering defines the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. The goal of chaos engineering is to identify weaknesses in a system through controlled experiments/tests that introduce random and unpredictable behavior.
While the goal of "Software Engineering" is to put the capabilities into production, the goal of "Chaos Engineering" is to ensure that the product does not run into failures. It is a disciplined approach to identifying failures before they become outages.
As the businesses and digital footprints are growing so are the cost of downtimes. One single downtime of few mins could cost a business millions of dollars.
With the rise of microservices architecture and distributed cloud systems, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.
These outages & failures are often in complex and distributed systems, where several things fail at the same time, thereby compounding the problem. Finding the bugs and fixing them takes a couple of minutes to hours depending on system architecture, causing not only loss of revenue to the company but also loss of customer trust.
The system is built to handle individual failures, but in big chaotic systems, failure of systems or processes may lead to severe outages. The term Microservice Death Star, refers to an architecture that is poorly designed, has highly interdependent complex systems that are slow, inflexible and can blow up and lead to failure.
In the old world, our system was more simplistic due to monolithic architecture. It was easy to debug errors and consequently fix them. Code changes were shipped once a quarter, or half-yearly. But today, architecture has changed a lot with migration to the cloud where innovation and speed of execution have become part for our system. The system is changing not in order of weeks and days but in order of minutes and hours.
Usage of cloud-based and microservice architecture has provided us with a lot of advantages but come with complexity and chaos which can cause failure. It is an engineer’s responsibility to make the system as reliable as it can be.
A main benefit of chaos engineering is that organizations can use it to identify vulnerabilities before a hacker does or before a system failure. Changes made as a result of chaos engineering testing increase confidence in an organization's systems.
Chaos engineering experiments intentionally generate turbulent conditions in a distributed system to test the system and find weaknesses. Some of the key areas of a chaos experiment include:
Chaos engineering is similar to stress testing in that it aims to identify and correct system or network issues. Unlike stress testing, chaos engineering doesn't test and correct one component at a time but it covers the system as a whole.
The process is typically divided into several steps:
Chaos engineers use the given below fallacies of distributed computing as core principles:
But even extensive testing does not provide us with a guaranteed error-free system because this testing examines only pre-defined and single scenarios. The results don't cover new information about the application, system behavior, performance, and properties. This uncertainty increases with the use of microservice architectures, where the system grows with passing time.
Whereas in chaos, it generates a wide range and unpredictable outcome for experimenting on a distributed architecture to build confidence in the system’s capability and withstand turbulent conditions in production. Chaos Testing is a deliberate introduction of failure and faulty scenarios into our system to understand how the system will react and what could be its side effects. This type of testing is an effective method to prevent/minimize outages before they impact the system and ultimately the business.
There are many chaos experiments that we can inject and test our system with, which mainly depend on our goals and system architecture.
Below is a list of the most common chaos tests:
In the present scenario of Software Development, chaos engineering has become a magnificent tool which can help organizations to not only improve resiliency, flexibility, and velocity of the system, but also helps in operating distributed system. Along with these benefits, it has also provided us with remediation of the issue before it impacts the system. Implementation of Chaos Engineering is important and should be adopted for better outcomes.