Last updated on December 23rd, 2023 at 11:30 am

As Technical Program Managers, we are often tasked with leading teams in Building Resilient Microservices. This post goes into all the things a TPM needs to keep in mind while building microservices

We are going to discuss best practices for 

  • Building resilient, stable, and predictable services
  • Preventing cascading failures
  • Timeouts patterns
  • Retry patterns
  • Circuit breakers
  • Challenges at scale

Microservices Architecture

The top-level architecture in general consists of clients either web or mobile who would interact with the server-side microservices. Microservices services include the API gateway, services that provide user accounts, Recommendation services, etc.

As these services are microservices they are isolated services and work independently. All these services have a cache layer. Microservices also mean that each of these services will have its own database dependencies and cache layers. All these services need to be available at all times so that users can consume them at any time regardless of which medium of connecting to these services are being used such as mobile, computers.

Advantages of Microservices

Moving towards micro-services architecture has many advantages such as simplicity to use as each service is responsible for its own functionality. Other advantages include isolation of problems, easy scale up and scale down, easy deployments, clear separation of concerns and heterogeneity, and polyglotism. 

Teams would have full command and authority over these micro-services to troubleshoot any problems and make decisions. They are responsible for the deployments and managing the services.

Disadvantages of Microservices

Besides advantages, there are a few disadvantages of micro-services as well. There are distributors systems that could fail problems with eventual consistency. You need teams to monitor when things could go wrong. More efforts in terms of deployments and release management are required. There are challenges in testing various services evolving independently and regressions tests.

Building Resilient System

A resilient system is defined as a system that processes transactions even if there are minor transient impulses and persistent stresses and it keeps on working even if there are failures. 

Accept failures will happen which helps to design and develop crumple zones. This system has the power to accept and contain failures. When this code is produced and released, many places and monitored where things can go wrong.

There are many places where things can go wrong and there are many kinds of failures that could occur while building a resilient system such as:

  • Challenges at scale (things that worked a few years back, may not work now due to rapid growth.
  • Other integration points failures such as network errors, semantic errors, slow responses, outright hang, and GC issues.

Resiliency Planning

There are two main stages of planning resiliency. The first stage is when you are developing code. To avoid cascading failures at this stage, take care of circuit breaker, timeouts, retry, bulkhead, and cache optimization. In order to avoid malicious clients, rate-limiting could be used. 

All of these areas should be taken care of when you are in the developing stage of code-

Stage One (Before Deployment): This stage is very important as it deals with failures before you deploy code. You can do load tests, a/b, and longevity tests.

Stage Two (After Deployment): This stage is to watch out for failures after the deployment of the codes. You can do this by monitoring health checks and metrics and setting appropriate alerts and alarms. 

Cascading Failures

Cascading is caused by a chain reaction. For example, there are many nodes and one of the nodes fails which is supposed to take part in the functioning of the solution. Other nodes have to pick up the workload to keep things going. This eventually leads to performance degradation.

One more example of cascading failure is with aggregation. You have a service that has many other microservices and a user try to load profile picture. For this purpose, a certain sort of aggregated information is required to return to the client. If aggregated API is not working then this could also cause cascading failures.

Timeouts

When a client makes a request, it expects a response in a certain amount of time. If the system is facing cascading failures, whether it is a failure, success, or a job queued, it should have a set timeouts patterns so that a step can be taken in order to resolve the client’s issues.

Types of Timeouts

  • Connection timeouts is the maximum time before a connection can be established or display an error message.
  • Socket timeouts is the maximum time a connection remains inactive between two packets once the connection is established.

Timeouts Patterns:

Timeouts and retries are interconnected and if a client is facing transient failures, he can solve them through fast retries. However, the problems in the network are supposed to last a little longer thus making several retries a failure. 

Retry Patterns:

Retry patterns can be tried in case of network failures, timeouts, or server errors. This will only help if there is a transient network error such as dropped connections or server failures. Retries may not help in case of permanent errors. If a service is slow or malfunctioning and retries keep on happening then it will make problems go even worse. 

The only solution to this problem is to use exponential back-off and circuit breaker patterns. Exponential backoff is using longer breaks between retries for showing error responses. It shows an error response with a delay of one minute, two, and so on. This way you will have maximum numbers of retires as well.

Circuit Breaker Pattern:

The circuit breaker is an electrical device that is used in an electrical panel that monitors and controls the number of amperes being sent through. This kind of pattern is used in almost all electrical devices. A circuit breaker trips if a power surge occurs and switches off while shutting the power from that specific breaker. If a node is behaving badly, no more requests will be sent to that particular node anymore and move it to the penalty box.

Bulkhead:

Bulkhead allows isolation of failure in a particular zone and does not allow it to go to the other zone to create chain reactions. It also helps in preventing cascading failures. For example, isolating the database dependencies per service and allowing other services to function normally. Other infrastructures components can also be isolated such as cache infrastructure.

Rate Limiting:

Rate limiting could also be helpful to restrict further retires requests made by a client. A specific client can be identified through the token code he used or through his IP address. Rate limiting can be implemented as a filter through JAX-RS. This can send a 429 error to the client and allow him to retry after a specific time.

Cache Optimization:

Cache optimization is storing response information of requests in temporary storage for a certain amount of time. This happens to ensure that the server is not overloaded with those requests in the future and cache helps to fulfill them on its own. It is just like getting a thing from your table other than fetching it from a drawer or going outside to a shop to get it.

Dealing With Latencies in Response:

Service could be slow at the time so to deal with this; you can have timeouts for aggregation services or dispatch requests in parallel and collect the response. You can attach a priority with all the responses you collect such as if scheduling services fail; you can decide what kind of response you can send to your clients.

Handling Partial Failures –Best Practices:

If a service is slow or unavailable, you can get the information from the cache and return to the client. This way you can return partial results. Never block a service rather provide a caching layer and return cached data to the clients.

Asynchronous Patterns:

If there are long-running requests and some resources take longer to provide results, do not make clients wait for the results. Use asynchronous patterns optimization. 

Reactive Programming Model:

Use reactive programming models such as CompletableFuture in Java 8, ListenableFuture, and RX java.

Asynchronous API:

Asynchronous API can be used in reactive patterns, message passing message queues, and web sockets.

Logging:

Logging is very crucial in web-based architecture as complex distributed systems cause many failures. Logging helps to link events and transactions between multiple components such as Loggly, Splunk, ELK stack, and log entries.

Logging Best Practices:

The same pattern should be followed across all the logs. Sensitive data should never be logged. A caller can be identified as a part of logs to identify who has taken the action. Pay logs should also be not logged at any cost.

Best Practices When Designing APIs for Mobile Clients:

One of the best practices while designing APIs is to avoid chattiness. There is no use of sending many requests back and forth. Use aggregation pattern to send all the data through this framework.

Resiliency Planning – Stage One:

This stage comes before you deploy. A few things should be kept in mind while deploying codes such as load testing, longevity testing, and capacity planning. For load testing, make sure you test for the load on APIs. Jmeter is a tool that can be helpful to load tests. For longevity testing, run the service for a few days and check if it could have the capacity to run for a longer period of time. Anticipate all the growth requirements and go for capacity planning accordingly.

Resilience Planning – Stage Two:

This stage comes after the deployment of code. Go for health checks, metrics, and phased-out features. For health checks, check memory, CPU, threads, error rates. Send an alert if any of these exceeds 75% of the thresh hold to go for the remedial actions to prevent any failures. Check response time and monitor any unusual activity. Third-party library metrics could also be used such as couchbase hits and atop. Load average, uptime, and log sizes are the other metrics that could be useful.

The monitoring server checks the production environment and sends alerts. Pager duty and email server are used to send alerts. The monitoring stack is based on aggregation framework (application), new relic (java, python) (OS application code), collected, and graphite (network, server). 

The Rollout of New Features:

Rollouts should be phased to target a specific amount of customers at a time. Turn it off if it is not working as expected and check for alerts.

Real-Time Examples:

Netflix induces failures of services during the working day to test both the application’s resilience and monitoring. Latency monkey simulates for slow-running requests.

Failures do happen at all times. What we can do to create an error-free service is to look out for systems failures all the time and expect them to happen at any time and plan accordingly to prevent them. Errors occur to make your services better. Run an RCA (root cause analysis), learn, and grow.

That’s all Folks!

Mario Gerard

Also, check out the wonderful podcast on microservices with Vidhya Varat Agarwal here.