Fallback to on-demand EC2 instances if spot capacity is unavailable

Michael Wittig – 21 Mar 2023

In recent months, I was again reminded that EC2 spot capacity is not always available. For years, I was looking for a safety net for my spot-based Auto Scaling Groups (ASGs). If spot capacity is unavailable, launch on-demand EC2 instances and replace them with spot as soon as spot capacity is back. After many proofs of concept, I want to share my approach to the problem.

Safety net

I assume your existing ASG is configured to spread the load across as many availability zones and instance types as possible. Besides that, I encourage you to enable Capacity Rebalancing to handle spot interruptions. Besides that, add the following resources to implement the on-demand safety net:

Fallback ASG to launch on-demand EC2 instances
Two step scaling policies to scale up/down the fallback ASG
Two CloudWatch alarms to trigger the scaling policies

Configure existing ASG

Enable your existing ASG to emit the CloudWatch metrics GroupInServiceInstances and GroupDesiredCapacity.

In CloudFormation:

SpotAutoScalingGroup:
  Type: 'AWS::AutoScaling::AutoScalingGroup'
  Properties:
    # [...]
    CapacityRebalance: true
    MaxSize: 10
    MinSize: 2
    MixedInstancesPolicy:
      # [...]
      InstancesDistribution:
          OnDemandAllocationStrategy: prioritized
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: 'capacity-optimized-prioritized'
    MetricsCollection:
    - Granularity: 1Minute
      Metrics:
      - GroupInServiceInstances
      - GroupDesiredCapacity

Configure additional fallback ASG

Add a new ASG to spin up on-demand capacity. Use the same launch template/configuration as your spot ASG.

FallbackAutoScalingGroup:
  Type: 'AWS::AutoScaling::AutoScalingGroup'
  Properties:
    # [...]
    MetricsCollection:
    - Granularity: 1Minute
      Metrics:
      - GroupInServiceInstances
      - GroupDesiredCapacity
    MaxSize: 10 # set this to the same value as your spot max size
    MinSize: 0

Create CloudWatch alarms to trigger auto-scaling

The trick is to use the following formula to calculate the number of instances that need to be added/removed from the fallback ASG:

desired spot-running spot-desired fallback

The following table helps you to understand the formula with some examples:

example	desired spot	running spot	desired fallback	result
all good, spot capacity is available	4	4	0	0
spot capacity is missing	4	3	0	1
spot capacity is missing, but fallback capacity is already started	4	3	1	0
spot capacity is available; fallback capacity can be removed	4	4	1	-1

The following logic is needed to work with the result of the formula:

If result > 0: increase the desired capacity of the fallback ASG by result.
Else if result < 0: decrease the desired capacity of the fallback ASG by result.
Else: do nothing.

The logic can be implemented with CloudWatch alarms and step scaling policies.

CloudWatch alarms trigger the step scaling policies to scale up/down the fallback ASG. To reduce noise caused by auto-scaling activities in the spot ASG, I configured the alarms only to fire if the formula is negative/positive three times in a row. The following two CloudWatch alarms are mostly identical, except for the ComparisonOperator.

FallbackScaleUpAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmActions:
    - !Ref FallbackScaleUp
    ComparisonOperator: GreaterThanThreshold
    EvaluationPeriods: 3 # if for three times in a row...
    Threshold: 0         # ...the formula result is > 0, trigger alarm
    TreatMissingData: notBreaching
    Metrics:
    - Id: running # get the value for running spot
      Label: running
      MetricStat:
        Metric:
          Namespace: 'AWS/AutoScaling'
          MetricName: GroupInServiceInstances
          Dimensions:
          - Name: AutoScalingGroupName
            Value: !Ref SpotAutoScalingGroup
        Period: 60
        Stat: Maximum
      ReturnData: false
    - Id: desired  # get the value for desired spot
      Label: desired
      MetricStat:
        Metric:
          Namespace: 'AWS/AutoScaling'
          MetricName: GroupDesiredCapacity
          Dimensions:
          - Name: AutoScalingGroupName
            Value: !Ref SpotAutoScalingGroup
        Period: 60
        Stat: Maximum
      ReturnData: false
    - Id: desiredfallback # get the value for desired fallback
      Label: desiredfallback
      MetricStat:
        Metric:
          Namespace: 'AWS/AutoScaling'
          MetricName: GroupDesiredCapacity
          Dimensions:
          - Name: AutoScalingGroupName
            Value: !Ref FallbackAutoScalingGroup
        Period: 60
        Stat: Maximum
      ReturnData: false
    - Expression: 'desired-running-desiredfallback' # this is the formula presented earlier
      Id: e1
      Label: 'fallback'
      ReturnData: true
FallbackScaleDownAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmActions:
    - !Ref FallbackScaleDown
    ComparisonOperator: LessThanThreshold
    EvaluationPeriods: 3 # if for three times in a row...
    Threshold: 0         # ...the formula result is < 0, trigger alarm
    TreatMissingData: notBreaching
    Metrics:
      # [...] same as in FallbackScaleUpAlarm

In an ideal world, we could use the result of the formula to change the desired capacity directly. Remember, the formula calculates the instances that need to be added (positive values)/removed (negative values) from the fallback ASG. Unfortunately, we must take a slight detour via a step scaling policy.

The CloudWatch alarm triggers the step scaling policy with the formula result.
The step scaling policy translates the received value into a change in capacity (adjustment)…
…and updates the desired count of the ASG.

You can configure how the step scaling policy transforms the value from CloudWatch into a change in capacity by defining step adjustments. A step is defined by a lower and upper bound and a change in capacity.

I use the following steps to translate from the formula result to a change in desired capacity:

policy	range	change in desired capacity
up	0 <= result < 2	+1
up	2 <= result < 3	+2
up	3 <= result < 4	+3
up	4 <= result < 5	+4
up	5 <= result < 10	+5
up	10 <= result < 25	+10
up	25 <= result < +infinity	+25
down	0 >= fallback > -2	-1
down	-2 >= fallback > -3	-2
down	-3 >= fallback > -4	-3
down	-4 >= fallback > -5	-4
down	-5 >= fallback > -infinity	-5

You can define up to 20 adjustments per step scaling policy.

FallbackScaleUp:
  Type: 'AWS::AutoScaling::ScalingPolicy'
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref FallbackAutoScalingGroup
    EstimatedInstanceWarmup: 300
    MetricAggregationType: Average
    PolicyType: StepScaling
    StepAdjustments: # the lower bound is inclusive and the upper bound is exclusive
    - MetricIntervalLowerBound: 0
      MetricIntervalUpperBound: 2
      ScalingAdjustment: 1
    - MetricIntervalLowerBound: 2
      MetricIntervalUpperBound: 3
      ScalingAdjustment: 2
    - MetricIntervalLowerBound: 3
      MetricIntervalUpperBound: 4
      ScalingAdjustment: 3
    - MetricIntervalLowerBound: 4
      MetricIntervalUpperBound: 5
      ScalingAdjustment: 4
    - MetricIntervalLowerBound: 5
      MetricIntervalUpperBound: 10
      ScalingAdjustment: 5
    - MetricIntervalLowerBound: 10
      MetricIntervalUpperBound: 25
      ScalingAdjustment: 10
    - MetricIntervalLowerBound: 25
      ScalingAdjustment: 25
FallbackScaleDown:
  Type: 'AWS::AutoScaling::ScalingPolicy'
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref FallbackAutoScalingGroup
    EstimatedInstanceWarmup: 300
    MetricAggregationType: Average
    PolicyType: StepScaling
    StepAdjustments: # the lower bound is exclusive and the upper bound is inclusive
    - MetricIntervalUpperBound: 0
      MetricIntervalLowerBound: -2
      ScalingAdjustment: -1
    - MetricIntervalUpperBound: -2
      MetricIntervalLowerBound: -3
      ScalingAdjustment: -2
    - MetricIntervalUpperBound: -3
      MetricIntervalLowerBound: -4
      ScalingAdjustment: -3
    - MetricIntervalUpperBound: -4
      MetricIntervalLowerBound: -5
      ScalingAdjustment: -4
    - MetricIntervalUpperBound: -5
      ScalingAdjustment: -5

Summary

The following graph shows the fallback in action:

Fallback in action

The red line shows the desired spot, the orange line shows the running spot, and the green line shows the running fallback.

9:25 two spot instances are desired and running (desired spot = 4; running spot = 2).
9:27 one additional spot instance is requested (desired spot = 3).
9:32 spot capacity not available; one fallback instance is running (desired spot = 3; running spot = 2; running fallback = 1)
9:35 one additional spot instance is requested (desired spot = 4)
9:40 spot capacity not available; two fallback instances are running (desired spot = 4; running spot = 2; running fallback = 2)

As you can see, it takes around 5 minutes for on-demand capacity to replace the missing spot capacity. This is caused by the 3 x 1-minute delay added by the CloudWatch alarm configuration and the delay introduced by starting an EC2 instance before it influences the GroupInServiceInstances metric. You could remove up to 2 minutes of delay by adjusting the CloudWatch alarms to only wait for one or two threshold violations before triggering the scaling action.

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.