Implementing resilient microservices for (not only) Java developers (1)
🕸️

Implementing resilient microservices for (not only) Java developers (1)

🎯 What you’ll learn: what are the good software engineering practices to follow when working in microservices based systems that can help to deal with failures and errors. You should also gain some understanding why these practices are important and in which real life situations they come in handy.
✅ What do you need to know before reading this post: you should be able to read samples of Java code (or any other popular OOP language) and have basic familiarity with the concept of microservices.
 

Introduction

It’s been over 10 years: microservices are here to stay. After all that time it’s clear that they can be a handful: the one thing that is abundant in microservices based solutions, is failure.
It’s just the probabilities: many smaller moving pieces means more network connections, more APIs, more varied technologies that fail in different ways, more deployments, things moving from machine to machine, changing network addresses, etc.
Things that fail are often the things you can’t control: the service owned by another team or even another company. When working with microservices, things will fail and it won’t be your fault. The only reasonable thing to do is to prepare.
Below you will find a list of good practices and things to pay attention to if you want to improve the odds of survival (and meeting the SLAs…) for your services. The list has two parts. First group of practices is about being a little defensive and prepping for things going south outside of your service. Then, you can think about being kind, and reacting to a failure in a way which doesn’t make things worse for everybody else.

Resilience libraries

💡
What is a resilience library? It’s a library providing abstractions that come in handy when reasoning about failure and implementations to handle them. Two Java world examples are resilience4j (https://resilience4j.readme.io/) and failsafe (https://failsafe.dev/). They both aim to help programmers handle failure in their code, though differ in style and user experience. Detailed comparison can be found here. When googling resilience related topics, you are likely to encounter references to Netflix Hystrix (), which is no longer actively developed. Using a resilience library helps make deliberate decisions about failure handling in your code.
Throughout the post, I’ll be showing you excerpts of code that use Failsafe.
If you haven’t used it, the most important concept in Failsafe is a resilience Policy . Policy can be combined and composed from other polices. Once a Policy is composed, it can be used to wrap any executable logic. Executing business logic encapsulated by a user provided Supplier using Failsafe may look like this:
Failsafe.with(fallback) 
  .compose(retryPolicy) 
  .compose(circuitBreaker)
  .compose(timeout) 
  .get(new Supplier(){
		...
		// API call, db call ...
}); // supplier executes business logic
fallback , retryPolicy , circuitBreaker and timeout are policies too: they control different aspects of handling a failure. We will discuss details further on.
Let’s proceed to the list.

Be reliable and fault-tolerant: the defences

Defence 1: If they’re going to fail, better fail fast ⌛

 

The use case

Imagine you work on a BillingDataService. BillingDataService delivers some data to users, but it first needs to confirm user’s permissions in an external authorization service AuthZService.
 
notion image
One day, in production, a subset of calls to 3rd party AuthZService starts to take a very long time and eventually fails.
As you investigate, you can see that the requests that are stuck hanging are requests on behalf of users who have a special character in their configured email. It’s a bug that the 3rd party service has apparently rolled out. When you discover this, you’re not worried too much: how many users with special characters in their emails there can be?
Unfortunately, you soon start getting reports from other users too. They are not able to access their data in your BillingDataService .
You are a bit confused at this point, but you find the root cause quickly: BillingDataService uses a pool of HTTP connections to talk to external world. The pool is configured in a way in which there can be only 1 connection to one host. All requests from BillingDataService to the 3rd party AuthZService go through that one connection. Because your services uses only 1 connection to issue all the requests to AuthZService, all subsequent requests incoming to your BillingDataService are stuck too waiting for the request of an affected user to be handled. It’s long enough for them to time out.
notion image
 

Good practice to implement

 
🏅
Fail fast. The first line of defence is configuring reasonable timeouts. You should configure timeouts when talking to anyone on the network.

Means to an end

 
When dealing with raw HTTP requests, you can usually configure timeout both at a client and request level. The details will depend on the client - it may be an exposed client setting or may require implementing RequestInterceptor .
Sample Java HTTP implementations:
 
The other choice - applicable to any executable code - is to configure timeouts with a policy. In Failsafe a timeout policy is pretty simple to create:
Timeout<Object> timeout = Timeout.of(Duration.ofSeconds(10)); Let’s keep filling in the initial example:
Failsafe.with(fallback) 
  .compose(retryPolicy) 
  .compose(circuitBreaker)
  .compose(Timeout.of(Duration.ofSeconds(1))) 
  .get(new Supplier(){
		...
		// API call, db call ...
}); // supplier executes business logic
It may seem as an overkill to use a Timeout policy for a single HTTP call: we can achieve the same by configuring the HTTP client. However, using a policy allows for much greater flexibility (it doesn’t have to be an HTTP call!) and, even in that simple case, it decouples failure handling from the mechanics of the call. Consider what will happen if you try to switch to a different HTTP client implementation. Policy can also help implement more subtle behaviours, for example, when we’re concerned about the time budget for a series of calls instead of a single call.

Defence 2: Success has many fathers, while failure… 🚢

The use case

The BillingDataService story above hinted something else: the issue was that the same resource (connection pool) was used to perform good (quick) calls and bad (slow to fail) calls. Therefore, because of the shared queue, authorization calls that could have been completed successfully had to wait for bad calls to complete and get dropped.
While the cause for the slowness in the previous example was unusual, there are some scenarios in which it is reasonable to expect a great variance in response times:
  • calls to a set of different services
  • calls to services hosted in different cloud regions or availability zones. This will likely be an internal implementation detail: you may be talking to an API gateway and only seeing the result - a subset of calls getting slower or starting to fail
  • The issue may happen to any other pooled and shared resources, such as threads. If a pool of threads executes tasks of different kinds and one kind of tasks gets slow, it will affect how long other kinds of tasks wait for execution.
 
Let’s consider a CloudMonitoringSerivce: it checks the health of resources that an organisation has in different clouds: Azure, AWS, GCP. The check itself can be a little lengthy, depending on the type of the resource. You can not use aggressive timeouts: it is pretty usual for those checks to take a while from time to time, especially if the resource is changing its state. Nevertheless, the checks are handled by a Thread pool, so a bit of slowness usually does not pose a problem.
One day one of the providers has an outage, and the checks become even slower for all resources hosted in their cloud! Suddenly all your threads are lagging, spending a lot of time checking resources in a single cloud. The users of your service get nervous. They use it to monitor other two cloud providers too, and now that data is stale. It’s only one of the providers who has an outage so they don’t understand why they can’t have up to date data about the two others.

Good practice to implement

🏅
Isolate failure. If a part of your system fails, make sure it does not affect other parts. This is sometimes referred to as a bulkhead design pattern. Similar to a cargo ship with sectioned partitions: if there is a hole in one of them and it gets filled with water, the cargo there gets destroyed, but the ship does not sink thanks to a bulkhead. In a similar manner, your service may not be able to complete its workload, but it should stay up and continue to serve workloads that can still succeed.
 
You should consider bulkheads especially if:
  • you’re talking to services in different geographies and response times
  • you’re talking to services with varying availability
  • some of the consumers are critical but others less important (if that distinction makes sense in your business domain)
  • if your service serves different use cases and some users may still be satisfied if only a part of them works. They may even be the same from a technical point of view - just 2 similar HTTP calls - but different from a business point of view: in our example, one use case serves Azure users and the other AWS users.

Means to an end

A bulkhead can be implemented on a different layer: in the case of CloudMonitoringService perhaps the checks for different cloud providers should be handled by different instances of the service or even different services. For example, this Azure article discusses bulkheads implemented on a system level.
Implementing bulkheads on a service level means having separate pools of resources for kinds of workloads that are different from a business perspective. Depending on your use case, you may be able to maintain separation by manually configuring sizes of thread pools, connection pools, queues and being careful about not sharing them.
If things get more complicated and/or you want to be more explicit about the intention, you can again defer to a resilience library.
Here is an example of a Failsafe bulkhead policy. The int value of 10 passed to the builder is the number of max concurrent threads that can enter the bulkhead.
// Wait up to 1 second for execution permission
Bulkhead<Object> bulkhead = Bulkhead
	.builder(10)
  .withMaxWaitTime(Duration.ofSeconds(1))
  .build();
source: https://failsafe.dev/bulkhead/
Failsafe's bulkhead can also be operated in a manual way, similar to Java’s Semaphore:
if (bulkhead.tryAcquirePermit()) {
  try {
    doSomething();
  } finally {
    bulkhead.releasePermit();
  }
}
source: https://failsafe.dev/bulkhead/
 

Defence 3: The next best thing ✉️

Real life use case:

Imagine you working on a CloudCostOptimizationSerivce.
Its job is to gather the data that is later used to make cost optimization suggestions about the usage of the cloud provider services in the organisation. To get all the data it needs, it needs to first issue a call to get a list of active virtual machine instances. Then it asks every singular instance about some more detailed data. One day, the call to list all the instances fails. Due to the lack of the proper monitoring and alerting, the small service that is supposed to get you the list, is down for a few days (!). Your service ends up with a few days worth gap of data.

Good practice to implement:

🏅
Consider using defaults, caching or a fallback to stale data
In the case of CloudCostOptimizationService , it would make sense to work with the (potentially) stale data. The list of virtual machines may be something that does not change that often, and likely some of the machines retrieved in a previous interval could still be found and could be queried to get the detailed data. As a result, the gap would not be total - there would be at least partial data, which may be better than none. In this case, context is king: you need to decide if the same applies to your use case.

Means to an end:

Using a default value, a cache or fallback is a contextual decision. It may be an option:
  • for less critical services. For example, perhaps a RecommendationService can still return a hardcoded fallback list of recommendations even if it can’t get the personalised recommendations for a particular user.
  • for situations in which partial result is better than no result
The implementation of this strategy will be very specific to your use case.
A useful abstraction may be Failsafe's Fallback. It allows you to provide a default value for a failed computation, as well as implement other behaviours. To cite Failsafe’s doc:
They can […] be used to suppress exceptions and provide a default result:
 

Do not make things worse

Being a good citizen 1: Is now a wrong time? ⏰

Using error codes and handling different error codes appropriately is a non-negotiable base for resiliency. Neglecting to do so is a surefire way to let cascading failure bring down your system. If your service goes down because of a failure in its dependency, unhandled exception or rejection, it propagates failure through the system.
In addition to business logic constraints, the most important technical distinction you should make when handling errors is if the error is transient or not. Transient errors might go away on their own (HTTPs 5xxx), while others will keep happening because you, as a caller, are doing something wrong (HTTP codes 4xx). A special case is HTTP 429 (Too Many Requests) or its equivalents in different protocols to which we’ll get back to in the next point.
If someone tells you what is wrong, the best thing is to listen and act appropriately: in case of error handling, this means retrying requests when there is a chance a request failed randomly. At scale, network failures start to happen more often, so as a programmer of a fault tolerant microservice, you should not give up too easily. At the same time, being too persistent doesn’t make sense - firstly, it may cause the total time of request handling to grow beyond acceptable limits. Secondly, if you’re talking to a service which is already in trouble (for example, because of an unusually big load) sending more requests its way may only make things worse.

Good practice to implement

🏅
Where it makes sense from a business perspective, retry potentially transient errors for a limited number of times, preferably with a backoff. One of most popular backoff strategies is to use exponential backoff, which will mean that the pause between subsequent retries will get longer.

Means to an end

When considering retry strategies, check the behaviour of the HTTP client in use. It may already retry requests - perhaps it will be sufficient to tweak this behaviour to your needs.
If you need to implement retries on your own, you can usually do this in a framework/client specific way - some frameworks will let you clone the request and fire it again, others will require you to use request interception.
However, as your application grows and becomes more complex, mixing API handling with request handling (retries) will have negative consequences for readability and maintainability. Again, in the Java world, there are libraries that abstract that aspect and provide an API which makes it easy to reason about how we want to handle retries across code bases. Let’s go with an example from failsafe that offers a convenient abstraction of a RetryPolicy :
// Retry on a ConnectException up to 3 times with a 1 second delay between attempts
RetryPolicy<Object> retryPolicy = RetryPolicy.builder()
  .handle(ConnectException.class)
  .withDelay(Duration.ofSeconds(1))
  .withMaxRetries(3)
  .build();
It may be worth to note that RetryPolicies can be combined with timeout configuration. In that case the order of policies matter: the timeout may apply to a single attempt or to all configured retries.
💡
Extra Tip: For many 3rd party service providers, retries with backoff are something that is likely to be implemented in the SDK. This may be yet another argument to use the SDK instead of issuing raw requests on your own (assuming the SDK it’s available and you don’t have constraints preventing the usage). For example, that is the case with AWS: https://docs.aws.amazon.com/general/latest/gr/api-retries.html.

Being a good citizen 2: Don’t overwhelm them 🌋

The use case

Previous point mentioned HTTP error 429 “Too many requests”. The most common circumstance in which you’ll see this is a 3rd party provider whose tiers have a request per time unit constraint (i.e. max 100 req/min, max 10 000req/hr, etc).
Depending on a particular API, it may be a bit difficult to handle these kinds of errors just with the mechanisms we discussed so far. You may be tempted to try to assess the delay that will make you fall into the constraints of the external API and solve the problem with retries.

Good practice to implement

🏅
If you’re aware of a constraint (rate limit) on an external resource, such as API, and usage of the resource gets near the constraint, it may be time to put rate limiting in place.

Means to an end

Short of implementing rate limiting yourself, you can again rely on a resilience library. In Failsafe :
// Permits execution every 10 ms
RateLimiter<Object> limiter = RateLimiter
	.smoothBuilder(Duration.ofMillis(10))
	.build();

// or

// Permits 10 executions every 1 second
RateLimiter<Object> limiter = RateLimiter.burstyBuilder(10, Duration.ofSeconds(1)).build();
source: https://failsafe.dev/rate-limiter/

Being a good citizen 3: Let them recover 🧑🏽‍⚕️

The use case

Let’s consider PersonalizedEmailService. It talks to a few other microservices to get different pieces of data about a retail site user - recommendations shown to them, orders so far, profile data, site usage statistics - and based on that sends a timely personalised marketing email to the user.
One day the service that provides usage statistics starts throwing 500s. You know the team that works on it. You get the internal information that this usually happens when they are low on memory, which in turns happens due to them loading too big of a chunk of data to the memory. If you keep sending them requests while they service is vulnerable, it may not survive.
While you’re not quite happy with the architectural choices that led to this situation, you want to help keep the system running.

Good practice to implement

🏅
You should consider a circuit breaker. The pattern that takes its name from an electrical circuit: the current flows through the circuit only when it’s closed. If you break (open) the circuit, the electrical current will stop flowing. In the case of the software, it’s usually the requests that we want to stop from reaching a vulnerable service or component.

Means to an end

Here is a sample CircuitBreaker configuration in Failsafe . It lets you specify how many failed requests should cause the circuit breaker to open (failure threshold), how long it should wait until it half opens (delay), and how many successive successful requests it takes for it to close. The example sets a fixed failure threshold (5), but it can also be specified as a ratio of failed requests (5 out of 10) or number of requests failed in a given time (10 in 3 minutes).
// Opens after 5 failures, half-opens after 1 minute, closes after 2 successes
CircuitBreaker<Object> breaker = CircuitBreaker.builder()
  .handle(ConnectException.class)
  .withFailureThreshold(5)
  .withDelay(Duration.ofMinutes(1))
  .withSuccessThreshold(2)
  .build();
source: https://failsafe.dev/circuit-breaker/
 
 

Other perspectives

Though this article we talked about the resiliency from a perspective of a developer working on a single or a few microservices. To give the lesson a little bit of colour, I pictured them in opposition to the external world, which is full of failure and traps. This exaggeration was both intentional and a bit fictional.
In reality, many of these patterns can be applied on a higher level or the need for them may be completely avoided with the right architectural decisions. One example is implementing the isolation of resources and/or failure by splitting a microservice into a few separate ones with a smaller scope of responsibility (ideally one responsibility). Another example, infrastructure permitting, would be implementing circuit breaker at the level of service mesh.
Finally, you will never be able to prevent all failures and some of them will be the result of human error. Maintaining the ability to rollback a bad change and having right processes in place for these situations is another topic we haven’t touched upon.

Summary

 
If you’re going to remember one thing from this post, it should be that calls to external services will fail in unexpected ways and in surprising patterns. The scale makes it worse and imminent.
 
If you neglect resiliency and build for a happy path only, there will be a moment in the evolution of your system when omissions in error handling and failure to account for failure will lead to frustrated on-call teams in war rooms trying to figure out how to stop the propagating cascade of mistakes. As an individual contributor, working within the constraints of a single microservice, you should consider these aspects of your code:
  • Configure all external requests with timeouts and retries that make sense for your business case
  • Handle errors with respect to their type: is the error transient? Should the request be retried?
  • Isolate the failure: watch out for using shared resources (queues, pools) to serve different workflows
  • Consider if and to what extent a partial failure is an option
  • Try to not make things worse, by using rate limiting and circuit breakers where applicable
Using a resilience library, such as resilience4j or Failsafe may help decouple failure handling from business logic, reuse failure handling code, highlight the intention, simplify implementation and improve the readability of the code.
Software Engineer’s life is about making compromises and choosing where to spend attention and effort, so that they matter the most. Hopefully this post will make it easier to invest a little in resiliency at just the right time.