While architecting distributed cloud applications, you should assume that failures will happen and design your applications for resiliency. A Microservice ecosystem is going to fail at some point or the other and hence you need to learn embracing failures. In short, design your microservices with failure in mind.
Chaos Testing is a practice to intentionally introduce failures in your system to test the resiliency and recovery of your microservices architecture. The Mean Time to Recovery(MTTR) needs to be minimized in the current modern day architectures. Hence it is beneficial to validate different failure scenarios ahead of time and take necessary action items to stabilize the system and make it more resilient.
Chaos Monkey is a popular resiliency tool created by Netflix that can help applications to handle random instance failures. Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment to raise errors and exception scenarios. Exposing the development team to failures more frequently assists them to build resilient services.
Fault Injection with Istio
With Istio, failures can be injected at the application layer like HTTP Errors or Delays to test the resiliency of the application. You can configure faults to be injected into requests that match specific conditions. You can inject either delays or faults into the requests. This will mimic service failures and latency between service calls.
Injecting planned errors and delays into your Production system will determine how resilient your microservice ecosystem is. It’s a good way to identify if there are cascading errors if notifications are triggered to development teams when there is an outage if there is proper observability available to identify the root cause of the outage and most importantly recovering from the failure.
Istio enables you to inject 2 types of faults — HTTP Error Codes and Time Delays.
Injecting HTTP Errors
The below VirtualService manifest introduces fault Injection rule to send 503 errors for 50% of the ServiceB v2 traffic —
Injecting Time Delays
The below VirtualService manifest introduces an HTTP delay of 10 sec for 50% of the incoming traffic to ServiceB v1 —
Istio provides an easy way to test the resiliency of your services, The injection of errors and delays are transparent to the application and does not require any code level changes. Since Envoy intercepts all the incoming and outgoing network traffic, it handles the fault injection at the network layer itself.
Check the previous articles related to Istio Service Mesh Resiliency features —
Additional Resources —