Resilient task scheduling with ECS Fargate

Andreas Wittig – 05 Nov 2019

Many applications use scheduled jobs to automate recurring tasks, such as:

Generating and sending a monthly report.
Disabling users who haven’t logged in for more than 365 days.
Deleting stale data from the database.

Scheduled Jobs

Doing so is simple, as long as an application runs on a single machine. In that case, cron, a time-based job scheduler does the trick. But what to do when your workload is running on ECS and Fargate?

Read on to learn why using cron or ECS scheduled tasks is not an option. On top of that, you will get to know an advanced solution for scheduling jobs on ECS and Fargate.

Don’t: Use cron inside a container

It is an obvious but wrong choice to use cron to trigger scheduled jobs.

It is an anti-pattern to run more than one process per container. Therefore, running your application and cron in one container is a no go.
There is no guarantee that ECS is running your container 24/7 exactly once. Therefore, using cron inside a container may cause jobs to run multiple times or not at all.
It is not very cost-effective to run a container on Fargate 24/7 to execute a job a few times per day/week/month.

Don’t: Use ECS scheduled tasks

AWS proposes the following solution in their documentation:

Open the AWS Management Console.
Select your ECS cluster.
Create a Scheduled Task based on a fixed interval or cron-like expression.

Behind the scenes, AWS is creating a CloudWatch event rule which starts an ECS task based on the defined schedule, as shown in the following figure.

A CloudWatch Events Rule starts an ECS task

I highly recommend not to use this approach because it is missing an important aspect: the monitoring of the scheduled job. Neither does this approach provide a way to define a timeout for the scheduled job, nor does it retry a failed job. Instead, ECS scheduled tasks are operating in fire-and-forget mode. In summary, this approach is not resilient.

Do: Use Step Functions to start and monitor an ECS task

Nevertheless, there is a more resilient solution to schedule jobs with ECS and Fargate. As shown in the figure below, three components work together to schedule jobs:

CloudWatch Events Rule: triggers the state machine based on a schedule.
Step Functions: a state machine orchestrating simple or complex workflows. In this scenario, the state machine starts a container and waits until the container exits.
Fargate: the computing engine for the container executing the scheduled job.

A CloudWatch Events Rule triggers a Step Function which starts an ECS task

The state machine is monitoring the health of the scheduled job. If necessary, the state machine will retry a failed job. Additionally, you should define a timeout for a scheduled job. When doing so, the state machine will stop a scheduled job after reaching the timeout to avoid running jobs endlessly, for example, in case of a misconfiguration.

Next, you will learn how to create a CloudWatch Events Rule, Step Functions state machine, and Fargate task definition with the help of CloudFormation.

First, you need to create the basic ECS and Fargate infrastructure consisting of an ECS Cluster, task definition, and security group. I’m skipping the details of how to do so here.

Cluster:
  Type: 'AWS::ECS::Cluster'
  Properties:
    # [...]
TaskDefinition:
  Type: 'AWS::ECS::TaskDefinition'
  Properties:
    # [...]
SecurityGroup:
  Type: 'AWS::EC2::SecurityGroup'
  Properties:
    # [...]

Next, create the state machine. What does the state machine do?

Create a Fargate task based on the task definition defined before.
Restart the Fargate task up to 3 times in case of a failure.
Stop the Fargate task if it has not exited after 600 seconds.

StateMachine:
  Type: 'AWS::StepFunctions::StateMachine'
  Properties:
    DefinitionString: !Sub
    - |
      {
        "Version": "1.0",
        "Comment": "Run ECS/Fargate tasks",
        "TimeoutSeconds": ${Timeout},
        "StartAt": "RunTask",
        "States": {
          "RunTask": {
            "Type": "Task",
            "Resource": "arn:aws:states:::ecs:runTask.sync",
            "Parameters": {
              "LaunchType": "FARGATE",
              "Cluster": "${Cluster}",
              "TaskDefinition": "${TaskDefinition}",
              "NetworkConfiguration": {
                "AwsvpcConfiguration": {
                  "Subnets": ["${SubnetA}", "${SubnetB}"],
                  "AssignPublicIp": "${AssignPublicIp}",
                  "SecurityGroups": ${SecurityGroups}
                }
              }
            },
            "Retry": [
              {
                "ErrorEquals": [
                  "States.TaskFailed"
                ],
                "IntervalSeconds": 10,
                "MaxAttempts": 3,
                "BackoffRate": 2
              }
            ],
            "End": true
          }
        }
      }
    - Cluster: !GetAtt Cluster.Arn
      TaskDefinition: !Ref TaskDefinition
      SubnetA: 'subnet-00680a8cf7dff9021' # Replace with your subnet
      SubnetB: 'subnet-023d074f51ea412d0' # Replace with your subnet
      AssignPublicIp: 'ENABLED'
      SecurityGroups: !Sub '["${SecurityGroup.GroupId}"]'
      Timeout: 600 # 10 minutes
    RoleArn: !GetAtt 'StateMachineRole.Arn'

Everything needs an IAM role. The following code snippet shows the IAM role for the state machine.

StateMachineRole:
  Type: 'AWS::IAM::Role'
  Properties:
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Principal:
          Service: 'states.amazonaws.com'
        Action: 'sts:AssumeRole'
    Policies:
    - PolicyName: StateMachine
      PolicyDocument:
        Statement:
        - Effect: Allow
          Action: 'iam:PassRole'
          Resource:
          - !GetAtt TaskExecutionRole.Arn
          - !GetAtt TaskRole.Arn
        - Effect: Allow
          Action: 'ecs:RunTask'
          Resource: !Ref TaskDefinition
          Condition:
            ArnEquals:
              'ecs:cluster': !GetAtt Cluster.Arn
        - Effect: Allow
          Action:
          - 'ecs:StopTask'
          - 'ecs:DescribeTasks'
          Resource: '*'
          Condition:
            ArnEquals:
              'ecs:cluster': !GetAtt Cluster.Arn
        - Effect: Allow
          Action:
          - 'events:PutTargets'
          - 'events:PutRule'
          - 'events:DescribeRule'
          Resource: !Sub 'arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForECSTaskRule'

There is one crucial part missing: the CloudWatch Event Rule, which triggers the state machine based on a schedule. The following snippet shows the rule which will trigger the state machine every hour (see ScheduleExpression.

Rule:
  Type: 'AWS::Events::Rule'
  Properties:
    ScheduleExpression: 'rate(1 hour)'
    State: ENABLED
    Targets:
    - Arn: !Ref StateMachine
      Id: statemachine
      RoleArn: !GetAtt 'RuleRole.Arn'
RuleRole:
  Type: 'AWS::IAM::Role'
  Properties:
    AssumeRolePolicyDocument:
      Statement:
      - Effect: Allow
        Principal:
          Service: 'events.amazonaws.com'
        Action: 'sts:AssumeRole'
    Policies:
    - PolicyName: EventRulePolicy
      PolicyDocument:
        Statement:
        - Effect: Allow
          Action: 'states:StartExecution'
          Resource: !Ref StateMachine

Define a scheduled expression in cron or rate style.

A few examples for schedule expressions in cron style:

Schedule Expression	Explanation
`cron(0 6 * * ? *)`	06:00 am (UTC) every day
`cron(0 12 * * MON-FRI *)`	12:00 am (UTC) from Monday to Friday
`cron(0 8 1 * ? *)`	08:00 am (UTC) every first day of the month

And more examples for schedule expressions in rate style:

Schedule Expression	Explanation
`rate(15 minutes)`	Every 15 minutes
`rate(1 hour)`	Every hour
`rate(14 days)`	Every 14 days

See Schedule Expressions for Rules for more detailed explanations.

One more thing, before we are done. I highly recommend creating two CloudWatch alarms to monitor failed executions and timeouts.

ExecutionsFailedAlarm:
  Condition: HasAlertingModule
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'Failure while executing scheduled task.'
    Namespace: 'AWS/States'
    MetricName: ExecutionsFailed
    Dimensions:
    - Name: StateMachineArn
      Value: !Ref StateMachine
    Statistic: Sum
    Period: 300
    DatapointsToAlarm: 1
    EvaluationPeriods: 1
    Threshold: 0
    TreatMissingData: notBreaching
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
    - 'arn:aws:sns:eu-west-1:111111111111:alerting' # Replace with SNS topic ARN
ExecutionsTimeoutAlarm:
  Condition: HasAlertingModule
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'Executing scheduled task timed out.'
    Namespace: 'AWS/States'
    MetricName: ExecutionsTimedOut
    Dimensions:
    - Name: StateMachineArn
      Value: !Ref StateMachine
    Statistic: Sum
    Period: 300
    DatapointsToAlarm: 1
    EvaluationPeriods: 1
    Threshold: 0
    TreatMissingData: notBreaching
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
    - 'arn:aws:sns:eu-west-1:111111111111:alerting' # Replace with SNS topic ARN

Summary

That’s all! Run scheduled tasks resiliently with the help of a CloudWatch Events Rule, a Step Functions state machine, and Fargate. Using ECS Scheduled Tasks or even worse, cron is not an option.

Andreas Wittig

I’ve been building on AWS since 2012 together with my brother Michael. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV,HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.