High availability is a no-brainer: EC2 auto-recovery

Andreas Wittig – 09 Nov 2015

Werner Vogels (CTO of AWS) is quoted with “Everything fails all the time.”. This does not mean AWS is an unreliable cloud provider. Quite the contrary: AWS plans for failure. All services are highly available or fault tolerant. Some of them by default, some of them offer tools to achieve this goal.

Problem

An EC2 instance (virtual machine) is not highly available by default. The underlying virtualization layer, the operating system of the host system or the hardware of the host system are possible points of failure. If one of these parts break, the EC2 instance will become unavailable.

Solution

AWS offers tools to handle the failure of an EC2 instance. The following figure shows the easiest way to recover from a failure:

The EC2 instance fails for one of the previously described reasons.
A health check of the EC2 instance is performed automatically in the background and reported to CloudWatch, the monitoring service from AWS.
A CloudWatch alarm triggers the recovery of the EC2 instance if the health check detects a failure.
A new EC2 instance will be started automatically to replace the failed one.
The new EC2 instance is a clone of the failed EC2 instance. The ID, the private and public IP addresses will stay the same. As long as data is stored on EBS volumes, no data is lost.

EC2 auto-recovery process

The following components are needed to setup auto-recovery for EC2 instances:

EC2 instance from C3, C4, M3, M4, R3, or T2 family
CloudWatch alarm based on health check
ElasticIP if you want to keep the same public IP address after an auto-recovery

Use CloudFormation template

I have written a template that you can use to launch an EC2 instance with auto-recovery. It uses Infrastructure as Code to create the needed components and links. You can use AWS CloudFormation to create your EC2 instance with auto-recovery in minutes. The GitHub repository widdix/aws-cf-templates contains the CloudFormation template for EC2 with auto-recovery and some more useful templates.

Next steps

This solution can recover a failed EC2 instance. But it is only able to recover the EC2 instance in the same availability zone (also known as a data center). If the whole availability zone is affected by an outage, your EC2 instance will fail. It is possible to plan for an outage of an availability zone, too. If you are interested, I can recommend our book Amazon Web Services in Action or the AWS documentation about Auto Scaling and ELB.

Andreas Wittig

I’ve been building on AWS since 2012 together with my brother Michael. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, attachmentAV, HyperEnv, and marbot.

Here are the contact options for feedback and questions.