Fallback to on-demand EC2 instances if spot capacity is unavailable
In recent months, I was again reminded that EC2 spot capacity is not always available. For years, I was looking for a safety net for my spot-based Auto Scaling Groups (ASGs). If spot capacity is unavailable, launch on-demand EC2 instances and replace them with spot as soon as spot capacity is back. After many proofs of concept, I want to share my approach to the problem.
I assume your existing ASG is configured to spread the load across as many availability zones and instance types as possible. Besides that, I encourage you to enable Capacity Rebalancing to handle spot interruptions. Besides that, add the following resources to implement the on-demand safety net:
- Fallback ASG to launch on-demand EC2 instances
- Two step scaling policies to scale up/down the fallback ASG
- Two CloudWatch alarms to trigger the scaling policies
Enable your existing ASG to emit the CloudWatch metrics
Add a new ASG to spin up on-demand capacity. Use the same launch template/configuration as your spot ASG.
The trick is to use the following formula to calculate the number of instances that need to be added/removed from the fallback ASG:
desired spot-running spot-desired fallback
The following table helps you to understand the formula with some examples:
|example||desired spot||running spot||desired fallback||result|
|all good, spot capacity is available||4||4||0||0|
|spot capacity is missing||4||3||0||1|
|spot capacity is missing, but fallback capacity is already started||4||3||1||0|
|spot capacity is available; fallback capacity can be removed||4||4||1||-1|
The following logic is needed to work with the result of the formula:
result > 0: increase the desired capacity of the fallback ASG by
result < 0: decrease the desired capacity of the fallback ASG by
Else: do nothing.
The logic can be implemented with CloudWatch alarms and step scaling policies.
CloudWatch alarms trigger the step scaling policies to scale up/down the fallback ASG. To reduce noise caused by auto-scaling activities in the spot ASG, I configured the alarms only to fire if the formula is negative/positive three times in a row. The following two CloudWatch alarms are mostly identical, except for the
In an ideal world, we could use the result of the formula to change the desired capacity directly. Remember, the formula calculates the instances that need to be added (positive values)/removed (negative values) from the fallback ASG. Unfortunately, we must take a slight detour via a step scaling policy.
- The CloudWatch alarm triggers the step scaling policy with the formula result.
- The step scaling policy translates the received value into a change in capacity (adjustment)…
- …and updates the desired count of the ASG.
You can configure how the step scaling policy transforms the value from CloudWatch into a change in capacity by defining step adjustments. A step is defined by a lower and upper bound and a change in capacity.
I use the following steps to translate from the formula result to a change in desired capacity:
|policy||range||change in desired capacity|
|up||0 <= result < 2||+1|
|up||2 <= result < 3||+2|
|up||3 <= result < 4||+3|
|up||4 <= result < 5||+4|
|up||5 <= result < 10||+5|
|up||10 <= result < 25||+10|
|up||25 <= result < +infinity||+25|
|down||0 >= fallback > -2||-1|
|down||-2 >= fallback > -3||-2|
|down||-3 >= fallback > -4||-3|
|down||-4 >= fallback > -5||-4|
|down||-5 >= fallback > -infinity||-5|
You can define up to 20 adjustments per step scaling policy.
The following graph shows the fallback in action:
The red line shows the desired spot, the orange line shows the running spot, and the green line shows the running fallback.
- 9:25 two spot instances are desired and running (
desired spot = 4; running spot = 2).
- 9:27 one additional spot instance is requested (
desired spot = 3).
- 9:32 spot capacity not available; one fallback instance is running (
desired spot = 3; running spot = 2; running fallback = 1)
- 9:35 one additional spot instance is requested (
desired spot = 4)
- 9:40 spot capacity not available; two fallback instances are running (
desired spot = 4; running spot = 2; running fallback = 2)
As you can see, it takes around 5 minutes for on-demand capacity to replace the missing spot capacity. This is caused by the 3 x 1-minute delay added by the CloudWatch alarm configuration and the delay introduced by starting an EC2 instance before it influences the
GroupInServiceInstances metric. You could remove up to 2 minutes of delay by adjusting the CloudWatch alarms to only wait for one or two threshold violations before triggering the scaling action.