Werner Vogels (CTO of AWS) is quoted with “Everything fails all the time.”. This does not mean AWS is an unreliable cloud provider. Quite the contrary: AWS plans for failure. All services are highly available or fault tolerant. Some of them by default, some of them offer tools to achieve this goal.
An EC2 instance (virtual machine) is not highly available by default. The underlying virtualization layer, the operating system of the host system or the hardware of the host system are possible points of failure. If one of these parts break, the EC2 instance will become unavailable.
AWS offers tools to handle the failure of an EC2 instance. The following figure shows the easiest way to recover from a failure:
- The EC2 instance fails for one of the previously described reasons.
- A health check of the EC2 instance is performed automatically in the background and reported to CloudWatch, the monitoring service from AWS.
- A CloudWatch alarm triggers the recovery of the EC2 instance if the health check detects a failure.
- A new EC2 instance will be started automatically to replace the failed one.
- The new EC2 instance is a clone of the failed EC2 instance. The ID, the private and public IP addresses will stay the same. As long as data is stored on EBS volumes, no data is lost.
The following components are needed to setup auto-recovery for EC2 instances:
- EC2 instance from C3, C4, M3, M4, R3, or T2 family
- CloudWatch alarm based on health check
- ElasticIP if you want to keep the same public IP address after an auto-recovery
I have written a template that you can use to launch an EC2 instance with auto-recovery. It uses Infrastructure as Code to create the needed components and links. You can use AWS CloudFormation to create your EC2 instance with auto-recovery in minutes. The GitHub repository widdix/aws-cf-templates contains the CloudFormation template for EC2 with auto-recovery and some more useful templates.
This solution can recover a failed EC2 instance. But it is only able to recover the EC2 instance in the same availability zone (also known as a data center). If the whole availability zone is affected by an outage, your EC2 instance will fail. It is possible to plan for an outage of an availability zone, too. If you are interested, I can recommend our book Amazon Web Services in Action or the AWS documentation about Auto Scaling and ELB.