EBS Snapshot Pitfalls: Does your backup withstand reality?
Does your disaster recovery plan deliver what it promises? Here are three reasons why your plan won’t stand up to reality. Learn about common pitfalls when backing up EC2 instances with the help of EBS snapshots.
AWS describes EBS snapshots as crash-consistent. What does that mean? Imagine that a machine suddenly breaks. In what condition is the data on the hard disk? We don’t know whether the application and operating system wrote a consistent state to disk before the interruption. In other words: a crash-consistent snapshot is worthless for disaster recovery. You might restore corrupt or inconsistent data.
Why is that? All EBS cares about are data blocks; it reads and writes ones and zeros. The operating system running on EC2 is responsible for persisting data on behalf of the application.
Be aware that even official tools like AWS Backup do not offer a solution to this problem - not to mention the countless third-party providers.
Solution: Before creating a snapshot, tell the application and operating system to write a consistent state to disk. For example on Linux, use AWS Systems Manager Automation to halt the application and ask the operating system to flush caches before creating an EBS snapshot. Running on Windows? Check out Create a VSS application-consistent snapshot.
Restoring a volume based on an EBS snapshot typically takes a few seconds only. However, the data is not available from the beginning. Instead, EBS restores the data asynchronously. You will notice high latencies during that period.
If latency matters to your system, you should initialize a restored volume before ramping up traffic. To do so, make sure to read all blocks from the volume once. The following command does the trick on Linux. See Initialize Amazon EBS volumes for more detailed information.
fio --filename=/dev/xvda --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --name=ebs-restore
Depending on the volume size, the volume type, and the EC2 instance type, initializing the volume might take a while. For example, it will take about 23 minutes to initialize an EBS volume of type
gp3 with 500 GB connected to an EC2 instance of type
m5a.large. So make sure to add the initialization phase when planning the Recovery Time Objective (RTO). The bottleneck is the maximum throughput from EC2 instance to EBS volume of 360 MB/s. See Amazon EBS–optimized instances for more details.
It should also be mentioned that AWS offers a feature called Amazon EBS fast snapshot restore. By enabling fast restore, it is no longer necessary to initialize a restored volume as described above. The volume is ready for latency-critical workloads from the start. However, the pricing for fast snapshot restore make clear that this feature is not intended for this use case: $500 per snapshot and availability zone.
Solution: Consider the time it takes to initialize a recovered volume when planning for disaster recovery. While creating an EBS volume based on a snapshot usually takes a few seconds only, it is necessary to read all blocks of a restored volume to ensure low latency throughput.
When someone else is operating our infrastructure, we need to rely on SLAs to ensure the provider complies with our requirements. Therefore, most AWS services come with a well-defined SLA.
Unfortunately, the SLA for EC2 and EBS is very unspecific. AWS does not specify any objectives when it comes to restoring EBS snapshots. Therefore, it isn’t easy to evaluate to what extent the system can be relied upon. For example, what will happen during a significant outage when many customers decide to restore machines from snapshots to recover machines in another availability zone or region?
Solution: There is no technical solution to this problem. Talk to your AWS representative about this and ask them to be more specific about their SLA.
To ensure that you can recover all data in an emergency, there are a few stumbling blocks to avoid.
- By default creating an EBS snapshot results in a crash-consistent backup. Restoring the snapshot might lead to corrupt or inconsistent data. Make sure to halt the application and flush caches before creating a snapshot to avoid that.
- EBS restores data from snapshots asynchronously. Therefore, you should initialize the volume by reading all blocks before ramping up your workload.
- Unfortunately, AWS does not publish any information about the extent to which we can rely on the recovery process of EBS snapshots. Especially if you want to plan for significant outages on AWS, this is very unsatisfactory. AWS needs to improve here.
By the way, are you interested in an example on how to use AWS Systems Manager Automation to create application-consistent snapshots on Linux? Please let me know! I’m considering to write a blog post about that.