Protect AWS SDK calls with Bulkheads and Circuit Breakers

Michael Wittig – 17 Feb 2016

If you use one of the AWS SDKs to make queries to the AWS API you need to prepare for network unreliability. One of the AWS service that requires heavy SDK usage is DynamoDB (NoSQL database as a Service). I will use DynamoDB as an example in this post but this also applies to other AWS services like S3 and SQS that are used via SDKs.

Motivation

Calls to AWS services can be slow because of various reasons.

The network can be slow.
The AWS service can be slow.
Your virtual machine can be slow.

Whatever the reason is, your users will experience longer loading times. If things go wrong the slowness of one service can propagate through your system. You basically exhaust all resources (e.g. connection pools) down the way. So in the end your whole system can become slow just because of one call to the AWS API is slowed down. This is of significant importance if you deal with a micro services architecture.

Literature

In his book Release it!, Michael T. Nygard mentions effective techniques to increase the stability of a system that makes calls to other services:

Timeouts
Circuit Breakers
Bulkheads

You can use these techniques to protect your service while talking to other services that are not well-behaved.

Are AWS services well-behaved?

Yes they are! I want to highlight that AWS services are providing very high uptimes and low latency. The protection that we are going to create is not needed most of the time. Nevertheless if you use DynamoDB you can run into the situation where DynamoDB throttles your request by adding an artificial delay. This happens if your reads or writes exceed the provisioned capacity of the table. In this situation SDK calls will be slower than usual.

Timeouts

If your appointment doesn’t show up you will sooner or later stop waiting and continue do do something else than waiting. The problem with computers is that they don’t get bored if they wait. If you tell them to wait infinitely they are fine. But the user who wants to interact with your application is a human and get’s bored pretty soon.

Depending on the SDK you use you can set different timeouts. The Java SDK exposes four timeouts that you can define:

The connection timeout determines the time to wait when initially establishing a connection before giving up and timing out.
The socket timeout defined the time to wait for data to be transferred over an established, open connection before the connection times out and is closed.
The request timeout defines the time to wait for the single request to complete before giving up and timing out.
The client execution timeout defines the time to allow the client to complete the execution of an API call. This could include multiple requests in case of retries.

Let’s have a look at the defaults:

connection timeout: 50 seconds
socket timeout: 50 seconds
request timeout: disabled
client execution timeout: disabled

For DynamoDB the timeouts are much too high for simple GetItem, Query or PutItem operations. No user will accept a page loading for 50 seconds. Keep also in mind that while you wait for something you occupy resources that can’t be used for other requests.

I propose the following timeouts for DynamoDB:

final ClientConfiguration cc = new ClientConfiguration();
cc.setConnectionTimeout(500); // 500 milliseconds
cc.setSocketTimeout(1000); // 1 second
cc.setRequestTimeout(0);
cc.setClientExecutionTimeout(2000); // 2 seconds
final AmazonDynamoDBClient c = new AmazonDynamoDBClient(cc);
c.setRegion(Region.getRegion(Regions.EU_WEST_1));
final DynamoDB ddb = new DynamoDB(c);
final Table t1 = ddb.getTable("table1");
final Item i1 = t1.getItem("id", "abc");

Now you will get a timeout after 2 seconds in total or if the connection can not be established within 500 milliseconds or when the connection does not transfer data within 1 second. Depending on your use case 2 seconds may still be too long.

Bulkheads

The idea of a bulkhead is that you don’t lose the whole ship/system if something goes wrong by separating parts of the ship/system.

Bulkhead

Translated to our DynamoDB AWS SDK we will find one thing that is shared and should be separated more wisely: the connection pool. By default a client can make 50 connections in parallel.

Have a look at the following example:

final ClientConfiguration cc = new ClientConfiguration();
cc.setConnectionTimeout(500);
cc.setSocketTimeout(1000);
cc.setRequestTimeout(0);
cc.setClientExecutionTimeout(2000);
cc.setMaxConnections(50);
final AmazonDynamoDBClient c = new AmazonDynamoDBClient(cc);
final DynamoDB ddb = new DynamoDB(c);
final Table t1 = ddb.getTable("table1");
final Table t2 = ddb.getTable("table2");
final Item i1 = t1.getItem("id", "abc");
final Item i2 = t2.getItem("id", "abc");

What’s the problem? Let’s go back to the scenario where one of the tables (e.g. table1) runs out of capacity. DynamoDB will throttle request to table1. If a request takes longer it will also block one connection longer. Therefore the overall throughput is reduced. The problem is that this also affects request to table2. While all 50 connections are used to wait for slowed down answers from table1 your application can also not talk with table2. Increasing the maximum connection limit also doesn’t help. It just takes a bit longer until you run out of connections. But what about creating a AmazonDynamoDBClient for every table? You get two separated connection pools, exactly what we need for a bulkhead.

final ClientConfiguration cc = new ClientConfiguration();
cc.setConnectionTimeout(500);
cc.setSocketTimeout(1000);
cc.setRequestTimeout(0);
cc.setClientExecutionTimeout(2000);
cc.setMaxConnections(50);
final AmazonDynamoDBClient c1 = new AmazonDynamoDBClient(cc);
final AmazonDynamoDBClient c2 = new AmazonDynamoDBClient(cc);
final DynamoDB ddb1 = new DynamoDB(c1);
final DynamoDB ddb2 = new DynamoDB(c2);
final Table t1 = ddb1.getTable("table1");
final Table t2 = ddb2.getTable("table2");
final Item i1 = t1.getItem("id", "abc");
final Item i2 = t2.getItem("id", "abc");

In the above code example it is very explicit what client is used for what requests. This gets a bit more tricky if you use dependency injection. Make sure that you understand what connection is used for what requests.

Circuit Breakers

A circuit breaker will cut the circuit if the current reaches a certain threshold that will not be reached under normal conditions. To prevent something from burning the circuit breaker cut the circuit and eventually saves your life.

Circuit Breakers

Translated to the software world a circuit breaker can cut the normal path of your software when:

runtime exceeds a threshold
error rate exceeds a threshold
resource consumption exceeds a threshold

Instead of stopping everything when the circuit is interrupted a software circuit breaker can fallback to another code path. And after some time the software circuit breaker sends a few requests down the normal path to see if it’s working again. If yes it switches back to the normal path.

Translated to our DynamoDB AWS SDK it makes sense to protect against high latencies during throttling by falling back to another code path. DynamoDB will not only throttle our requests if we exceed capacity, if we make much more requests that we should, it will also start returning errors. Luckily Netflix implemented a software circuit breaker called Hystrix. The following example demonstrates how you can use Hystrix and the DynamoDB AWS SDK together.

To implement a circuit breaker you need to:

extend HystrixCommand
implement the run() method

The run() method represents the normal path of your application. As long as run() doesn’t take too long (2000 ms as specified) and doesn’t throw Exceptions too often (50 % of time per default) it will be invoked if you invoke the execute() function on your HystrixCommand.

 public static class GetItemCommand extends HystrixCommand<Item> {
  private final Table table;
  private final String id;

  public GetItemCommand(Table table, String id) {
    super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(table.getTableName()))
      .andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
        .withExecutionTimeoutInMilliseconds(2000))
      .andThreadPoolPropertiesDefaults(HystrixThreadPoolProperties.Setter()
        .withCoreSize(50))
    );
    this.table = table;
    this.id = id;
  }

  @Override
  protected Item run() {
    return this.table.getItem("id", this.id);
  }
}

final ClientConfiguration cc = new ClientConfiguration();
cc.setConnectionTimeout(500);
cc.setSocketTimeout(1000);
cc.setRequestTimeout(0);
cc.setClientExecutionTimeout(2000);
cc.setMaxConnections(50);
final AmazonDynamoDBClient c = new AmazonDynamoDBClient(cc);
final DynamoDB ddb = new DynamoDB(c);
final Table t1 = ddb.getTable("table1");
final Item i1 = new GetItemCommand(t1, "abc").execute();

If you would like to provide a fallback implementation you can also implement/override the getFallback() method. In some cases you could fallback to just retry the request and hope for lower latency this time, query an in-memory cache or return static data. You could also disable a feature entirely (e.g. not display recommendations to your users) but continue to display a details page. Read more about Hystrix if you are interested.

Summary

If you use the DynamoDB AWS SDK for Java it is highly recommended that you lower timeouts.
To prevent one throttled DynamoDB table to effect other parts of your application I recommend that you create a AmazonDynamoDBClient per table.
In case of failures like capacity exceeded exceptions or many exceeded timeouts a circuit breaker like Hystrix will cut the normal path and switch to another part to stop sending request at all.

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.

Protect AWS SDK calls with Bulkheads and Circuit Breakers

Motivation

Literature

Are AWS services well-behaved?

Timeouts

Bulkheads

Circuit Breakers

Summary

Michael Wittig

Further reading