Home › Microservices › Making your Microservices Resilient and Fault Tolerant

Making your Microservices Resilient and Fault Tolerant

By Samir Behara on August 6, 2018 • ( 4 )

In a Monolith application, a single error has the potential of bringing down the entire application. This can be avoided in a Microservice architecture since it contains smaller independently deployable units, which wont effect the entire system. But does that mean a Microservice architecture is resilient to failures? No, not at all. Converting your monolith into Microservices does not resolve the issues automatically. In fact working with a distributed system has its own challenges!

While architecting distributed cloud applications, you should assume that failures will happen and design your applications for resiliency. A Microservice ecosystem is going to fail at some point or the other and hence you need to learn embracing failures. Don’t design systems with the assumption that its going to be sunny throughout the year. Be realistic and account for the chances of having rain, snow, thunderstorms and other adverse conditions. In short, design your microservices with failure in mind. Things don’t go as per plan always and you need to be prepared for the worst case scenario.

If Service A calls Service B which in turn calls Service C, what happens when Service B is down? What is your fallback plan in such a scenario?

Can you return a pre-decided error message to the user?
Can you call another service to fetch the information?
Can you return values from cache instead?
Can you return a default value?

There are a number of things you can do to make sure that the entire chain of microservices does not fail with the failure of a single component.

What can go wrong in a Microservice architecture?

There are a number of moving components in a Microservice architecture, hence it has more points of failures. Failures can be caused by a variety of reasons – errors and exceptions in code, release of new code, bad deployments, hardware failures, datacenter failure, poor architecture, lack of unit tests, communication over the unreliable network, dependent services etc.

Why do you need to make service resilient?

A problem with Distributed applications is that they communicate over network – which is unreliable. Hence you need to design your microservices in such a way that they are fault tolerant and handle failures gracefully. In your microservice architecture, there might be a dozen of services talking with each other. You need to ensure that one failed service does not bring down the entire architecture.

How to make your services resilient?

Identify failure scenarios

Before releasing your new microservice to Production, make sure you have tested it good enough. Strange things might happen though and you should be ready for the worst case scenario. This means you should prepare to recover from all sort of failures gracefully and in a short duration of time. This gives confidence on the system’s ability to withstand failures & recover quickly with minimal impact. Hence it is important to identify failure scenarios in your architecture.

One way to achieve this is by making your microservices to fail and then try to recover from the failure. This process is commonly termed as Chaos Testing.
Think about scenarios like below and find out how the system behaves —

Service A is not able to communicate with Service B.
Database is not accessible.
Your application is not able to connect to the file system.
Server is down or not responding.
Inject faults/delays into the services.

Avoid cascading failures

When you have service dependencies built inside your architecture, you need to ensure that one failed service does not cause ripple effect among the entire chain. By avoiding cascading failures, you will be able to save network resources, make optimal use of threads and also allow the failed service to recover!

In a nutshell — Do not hammer a service with additional requests which is already down. Please give the service time to recover.

Avoid single points of failure

While designing your microservice, its always a good thing to think about how the microservice will behave if a particular component is down. That will lead to a healthier discussion and help is building a fault tolerant service.

Make sure not to design in a manner that your services are hugely dependent on one single component though. And if that happens, make sure to have a strategy to recover fast from the failure.

Design you applications such that you do not have single points of failure. You should be able to handle client requests at all times. Hence ensuring availability across the microservice ecosystem is critical.

Handle failures gracefully – Allow fast degradation

You should design your microservices such that it is fault tolerant – if there are errors/exceptions, the service should handle it gracefully by providing an error message or a default value. Your microservice needs to be fault tolerant and handle failures gracefully.

Design for Failures

By following some commonly used design patterns you can make your service self-healing. Let us discuss about these design patterns in detail now.

What are the Design Patterns to ensure Service Resiliency?

Circuit Breaker Pattern

If there are failures in your Microservice ecosystem, then you need to fail fast by opening the circuit. This ensures that no additional calls are made to the failing service, once the circuit breaker is open. So we return an exception immediately. This pattern also monitor the system for failures and once things are back to normal, the circuit is closed to allow normal functionality.

Circuit Breaker Design Pattern

This is a very common pattern to avoid cascading failure in your microservice ecosystem.

You can use some popular 3rd party libraries to implement circuit breaking in your application — Polly and Hystrix

Hystrix

Retry Design Pattern

This pattern states that you can retry a connection automatically which has failed earlier due to an exception. This is very handy in case of temporary issues with one of your services.
Lot of times a simple retry might fix the issue. The load balancer might point you to a different healthy server on the retry, and your call might be a success.

Timeout Design Pattern

This pattern states that you should not wait for a service response for an indefinite amount of time — throw an exception instead of waiting for too long.
This will ensure that you are not stuck in the limbo state – continuing to consume application resources. Once the timeout period is met, the thread is freed up.

Learn more about the Circuit Breaker Pattern here.