Fallback to on-demand EC2 instances if spot capacity is unavailable

Michael Wittig – 21 Mar 2023

In recent months, I was again reminded that EC2 spot capacity is not always available. For years, I was looking for a safety net for my spot-based Auto Scaling Groups (ASGs). If spot capacity is unavailable, launch on-demand EC2 instances and replace them with spot as soon as spot capacity is back. After many proofs of concept, I want to share my approach to the problem.

Safety net

I assume your existing ASG is configured to spread the load across as many availability zones and instance types as possible. Besides that, I encourage you to enable Capacity Rebalancing to handle spot interruptions. Besides that, add the following resources to implement the on-demand safety net:

  • Fallback ASG to launch on-demand EC2 instances
  • Two step scaling policies to scale up/down the fallback ASG
  • Two CloudWatch alarms to trigger the scaling policies

Configure existing ASG

Enable your existing ASG to emit the CloudWatch metrics GroupInServiceInstances and GroupDesiredCapacity.

In CloudFormation:

SpotAutoScalingGroup:
Type: 'AWS::AutoScaling::AutoScalingGroup'
Properties:
# [...]
CapacityRebalance: true
MaxSize: 10
MinSize: 2
MixedInstancesPolicy:
# [...]
InstancesDistribution:
OnDemandAllocationStrategy: prioritized
OnDemandBaseCapacity: 0
OnDemandPercentageAboveBaseCapacity: 0
SpotAllocationStrategy: 'capacity-optimized-prioritized'
MetricsCollection:
- Granularity: 1Minute
Metrics:
- GroupInServiceInstances
- GroupDesiredCapacity

Configure additional fallback ASG

Add a new ASG to spin up on-demand capacity. Use the same launch template/configuration as your spot ASG.

FallbackAutoScalingGroup:
Type: 'AWS::AutoScaling::AutoScalingGroup'
Properties:
# [...]
MetricsCollection:
- Granularity: 1Minute
Metrics:
- GroupInServiceInstances
- GroupDesiredCapacity
MaxSize: 10 # set this to the same value as your spot max size
MinSize: 0

Create CloudWatch alarms to trigger auto-scaling

The trick is to use the following formula to calculate the number of instances that need to be added/removed from the fallback ASG:

desired spot-running spot-desired fallback

The following table helps you to understand the formula with some examples:

example desired spot running spot desired fallback result
all good, spot capacity is available 4 4 0 0
spot capacity is missing 4 3 0 1
spot capacity is missing, but fallback capacity is already started 4 3 1 0
spot capacity is available; fallback capacity can be removed 4 4 1 -1

The following logic is needed to work with the result of the formula:

If result > 0: increase the desired capacity of the fallback ASG by result.
Else if result < 0: decrease the desired capacity of the fallback ASG by result.
Else: do nothing.

The logic can be implemented with CloudWatch alarms and step scaling policies.

CloudWatch alarms trigger the step scaling policies to scale up/down the fallback ASG. To reduce noise caused by auto-scaling activities in the spot ASG, I configured the alarms only to fire if the formula is negative/positive three times in a row. The following two CloudWatch alarms are mostly identical, except for the ComparisonOperator.

FallbackScaleUpAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmActions:
- !Ref FallbackScaleUp
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 3 # if for three times in a row...
Threshold: 0 # ...the formula result is > 0, trigger alarm
TreatMissingData: notBreaching
Metrics:
- Id: running # get the value for running spot
Label: running
MetricStat:
Metric:
Namespace: 'AWS/AutoScaling'
MetricName: GroupInServiceInstances
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref SpotAutoScalingGroup
Period: 60
Stat: Maximum
ReturnData: false
- Id: desired # get the value for desired spot
Label: desired
MetricStat:
Metric:
Namespace: 'AWS/AutoScaling'
MetricName: GroupDesiredCapacity
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref SpotAutoScalingGroup
Period: 60
Stat: Maximum
ReturnData: false
- Id: desiredfallback # get the value for desired fallback
Label: desiredfallback
MetricStat:
Metric:
Namespace: 'AWS/AutoScaling'
MetricName: GroupDesiredCapacity
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref FallbackAutoScalingGroup
Period: 60
Stat: Maximum
ReturnData: false
- Expression: 'desired-running-desiredfallback' # this is the formula presented earlier
Id: e1
Label: 'fallback'
ReturnData: true
FallbackScaleDownAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmActions:
- !Ref FallbackScaleDown
ComparisonOperator: LessThanThreshold
EvaluationPeriods: 3 # if for three times in a row...
Threshold: 0 # ...the formula result is < 0, trigger alarm
TreatMissingData: notBreaching
Metrics:
# [...] same as in FallbackScaleUpAlarm

In an ideal world, we could use the result of the formula to change the desired capacity directly. Remember, the formula calculates the instances that need to be added (positive values)/removed (negative values) from the fallback ASG. Unfortunately, we must take a slight detour via a step scaling policy.

  1. The CloudWatch alarm triggers the step scaling policy with the formula result.
  2. The step scaling policy translates the received value into a change in capacity (adjustment)…
  3. …and updates the desired count of the ASG.

You can configure how the step scaling policy transforms the value from CloudWatch into a change in capacity by defining step adjustments. A step is defined by a lower and upper bound and a change in capacity.

I use the following steps to translate from the formula result to a change in desired capacity:

policy range change in desired capacity
up 0 <= result < 2 +1
up 2 <= result < 3 +2
up 3 <= result < 4 +3
up 4 <= result < 5 +4
up 5 <= result < 10 +5
up 10 <= result < 25 +10
up 25 <= result < +infinity +25
down 0 >= fallback > -2 -1
down -2 >= fallback > -3 -2
down -3 >= fallback > -4 -3
down -4 >= fallback > -5 -4
down -5 >= fallback > -infinity -5

You can define up to 20 adjustments per step scaling policy.

FallbackScaleUp:
Type: 'AWS::AutoScaling::ScalingPolicy'
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref FallbackAutoScalingGroup
EstimatedInstanceWarmup: 300
MetricAggregationType: Average
PolicyType: StepScaling
StepAdjustments: # the lower bound is inclusive and the upper bound is exclusive
- MetricIntervalLowerBound: 0
MetricIntervalUpperBound: 2
ScalingAdjustment: 1
- MetricIntervalLowerBound: 2
MetricIntervalUpperBound: 3
ScalingAdjustment: 2
- MetricIntervalLowerBound: 3
MetricIntervalUpperBound: 4
ScalingAdjustment: 3
- MetricIntervalLowerBound: 4
MetricIntervalUpperBound: 5
ScalingAdjustment: 4
- MetricIntervalLowerBound: 5
MetricIntervalUpperBound: 10
ScalingAdjustment: 5
- MetricIntervalLowerBound: 10
MetricIntervalUpperBound: 25
ScalingAdjustment: 10
- MetricIntervalLowerBound: 25
ScalingAdjustment: 25
FallbackScaleDown:
Type: 'AWS::AutoScaling::ScalingPolicy'
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref FallbackAutoScalingGroup
EstimatedInstanceWarmup: 300
MetricAggregationType: Average
PolicyType: StepScaling
StepAdjustments: # the lower bound is exclusive and the upper bound is inclusive
- MetricIntervalUpperBound: 0
MetricIntervalLowerBound: -2
ScalingAdjustment: -1
- MetricIntervalUpperBound: -2
MetricIntervalLowerBound: -3
ScalingAdjustment: -2
- MetricIntervalUpperBound: -3
MetricIntervalLowerBound: -4
ScalingAdjustment: -3
- MetricIntervalUpperBound: -4
MetricIntervalLowerBound: -5
ScalingAdjustment: -4
- MetricIntervalUpperBound: -5
ScalingAdjustment: -5

Summary

The following graph shows the fallback in action:

Fallback in action

The red line shows the desired spot, the orange line shows the running spot, and the green line shows the running fallback.

  • 9:25 two spot instances are desired and running (desired spot = 4; running spot = 2).
  • 9:27 one additional spot instance is requested (desired spot = 3).
  • 9:32 spot capacity not available; one fallback instance is running (desired spot = 3; running spot = 2; running fallback = 1)
  • 9:35 one additional spot instance is requested (desired spot = 4)
  • 9:40 spot capacity not available; two fallback instances are running (desired spot = 4; running spot = 2; running fallback = 2)

As you can see, it takes around 5 minutes for on-demand capacity to replace the missing spot capacity. This is caused by the 3 x 1-minute delay added by the CloudWatch alarm configuration and the delay introduced by starting an EC2 instance before it influences the GroupInServiceInstances metric. You could remove up to 2 minutes of delay by adjusting the CloudWatch alarms to only wait for one or two threshold violations before triggering the scaling action.

Michael Wittig

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.