Fallback to on-demand EC2 instances if spot capacity is unavailable
In recent months, I was again reminded that EC2 spot capacity is not always available. For years, I was looking for a safety net for my spot-based Auto Scaling Groups (ASGs). If spot capacity is unavailable, launch on-demand EC2 instances and replace them with spot as soon as spot capacity is back. After many proofs of concept, I want to share my approach to the problem.
I assume your existing ASG is configured to spread the load across as many availability zones and instance types as possible. Besides that, I encourage you to enable Capacity Rebalancing to handle spot interruptions. Besides that, add the following resources to implement the on-demand safety net:
- Fallback ASG to launch on-demand EC2 instances
- Two step scaling policies to scale up/down the fallback ASG
- Two CloudWatch alarms to trigger the scaling policies
Configure existing ASG
Enable your existing ASG to emit the CloudWatch metrics GroupInServiceInstances
and GroupDesiredCapacity
.
In CloudFormation:
SpotAutoScalingGroup: |
Configure additional fallback ASG
Add a new ASG to spin up on-demand capacity. Use the same launch template/configuration as your spot ASG.
FallbackAutoScalingGroup: |
Create CloudWatch alarms to trigger auto-scaling
The trick is to use the following formula to calculate the number of instances that need to be added/removed from the fallback ASG:
desired spot-running spot-desired fallback |
The following table helps you to understand the formula with some examples:
example | desired spot | running spot | desired fallback | result |
---|---|---|---|---|
all good, spot capacity is available | 4 | 4 | 0 | 0 |
spot capacity is missing | 4 | 3 | 0 | 1 |
spot capacity is missing, but fallback capacity is already started | 4 | 3 | 1 | 0 |
spot capacity is available; fallback capacity can be removed | 4 | 4 | 1 | -1 |
The following logic is needed to work with the result of the formula:
If result > 0
: increase the desired capacity of the fallback ASG by result
.
Else if result < 0
: decrease the desired capacity of the fallback ASG by result
.
Else: do nothing.
The logic can be implemented with CloudWatch alarms and step scaling policies.
CloudWatch alarms trigger the step scaling policies to scale up/down the fallback ASG. To reduce noise caused by auto-scaling activities in the spot ASG, I configured the alarms only to fire if the formula is negative/positive three times in a row. The following two CloudWatch alarms are mostly identical, except for the ComparisonOperator
.
FallbackScaleUpAlarm: |
In an ideal world, we could use the result of the formula to change the desired capacity directly. Remember, the formula calculates the instances that need to be added (positive values)/removed (negative values) from the fallback ASG. Unfortunately, we must take a slight detour via a step scaling policy.
- The CloudWatch alarm triggers the step scaling policy with the formula result.
- The step scaling policy translates the received value into a change in capacity (adjustment)…
- …and updates the desired count of the ASG.
You can configure how the step scaling policy transforms the value from CloudWatch into a change in capacity by defining step adjustments. A step is defined by a lower and upper bound and a change in capacity.
I use the following steps to translate from the formula result to a change in desired capacity:
policy | range | change in desired capacity |
---|---|---|
up | 0 <= result < 2 | +1 |
up | 2 <= result < 3 | +2 |
up | 3 <= result < 4 | +3 |
up | 4 <= result < 5 | +4 |
up | 5 <= result < 10 | +5 |
up | 10 <= result < 25 | +10 |
up | 25 <= result < +infinity | +25 |
down | 0 >= fallback > -2 | -1 |
down | -2 >= fallback > -3 | -2 |
down | -3 >= fallback > -4 | -3 |
down | -4 >= fallback > -5 | -4 |
down | -5 >= fallback > -infinity | -5 |
You can define up to 20 adjustments per step scaling policy.
FallbackScaleUp: |
Summary
The following graph shows the fallback in action:
The red line shows the desired spot, the orange line shows the running spot, and the green line shows the running fallback.
- 9:25 two spot instances are desired and running (
desired spot = 4; running spot = 2
). - 9:27 one additional spot instance is requested (
desired spot = 3
). - 9:32 spot capacity not available; one fallback instance is running (
desired spot = 3; running spot = 2; running fallback = 1
) - 9:35 one additional spot instance is requested (
desired spot = 4
) - 9:40 spot capacity not available; two fallback instances are running (
desired spot = 4; running spot = 2; running fallback = 2
)
As you can see, it takes around 5 minutes for on-demand capacity to replace the missing spot capacity. This is caused by the 3 x 1-minute delay added by the CloudWatch alarm configuration and the delay introduced by starting an EC2 instance before it influences the GroupInServiceInstances
metric. You could remove up to 2 minutes of delay by adjusting the CloudWatch alarms to only wait for one or two threshold violations before triggering the scaling action.