Resilient task scheduling with ECS Fargate

Andreas Wittig – 05 Nov 2019

Many applications use scheduled jobs to automate recurring tasks, such as:

  • Generating and sending a monthly report.
  • Disabling users who haven’t logged in for more than 365 days.
  • Deleting stale data from the database.

Scheduled Jobs

Doing so is simple, as long as an application runs on a single machine. In that case, cron, a time-based job scheduler does the trick. But what to do when your workload is running on ECS and Fargate?

Read on to learn why using cron or ECS scheduled tasks is not an option. On top of that, you will get to know an advanced solution for scheduling jobs on ECS and Fargate.

Don’t: Use cron inside a container

It is an obvious but wrong choice to use cron to trigger scheduled jobs.

  1. It is an anti-pattern to run more than one process per container. Therefore, running your application and cron in one container is a no go.
  2. There is no guarantee that ECS is running your container 24/7 exactly once. Therefore, using cron inside a container may cause jobs to run multiple times or not at all.
  3. It is not very cost-effective to run a container on Fargate 24/7 to execute a job a few times per day/week/month.

Don’t: Use ECS scheduled tasks

AWS proposes the following solution in their documentation:

  1. Open the AWS Management Console.
  2. Select your ECS cluster.
  3. Create a Scheduled Task based on a fixed interval or cron-like expression.

Behind the scenes, AWS is creating a CloudWatch event rule which starts an ECS task based on the defined schedule, as shown in the following figure.

A CloudWatch Events Rule starts an ECS task

I highly recommend not to use this approach because it is missing an important aspect: the monitoring of the scheduled job. Neither does this approach provide a way to define a timeout for the scheduled job, nor does it retry a failed job. Instead, ECS scheduled tasks are operating in fire-and-forget mode. In summary, this approach is not resilient.

Do: Use Step Functions to start and monitor an ECS task

Nevertheless, there is a more resilient solution to schedule jobs with ECS and Fargate. As shown in the figure below, three components work together to schedule jobs:

  • CloudWatch Events Rule: triggers the state machine based on a schedule.
  • Step Functions: a state machine orchestrating simple or complex workflows. In this scenario, the state machine starts a container and waits until the container exits.
  • Fargate: the computing engine for the container executing the scheduled job.

A CloudWatch Events Rule triggers a Step Function which starts an ECS task

The state machine is monitoring the health of the scheduled job. If necessary, the state machine will retry a failed job. Additionally, you should define a timeout for a scheduled job. When doing so, the state machine will stop a scheduled job after reaching the timeout to avoid running jobs endlessly, for example, in case of a misconfiguration.

Next, you will learn how to create a CloudWatch Events Rule, Step Functions state machine, and Fargate task definition with the help of CloudFormation.

First, you need to create the basic ECS and Fargate infrastructure consisting of an ECS Cluster, task definition, and security group. I’m skipping the details of how to do so here.

Cluster:
Type: 'AWS::ECS::Cluster'
Properties:
# [...]
TaskDefinition:
Type: 'AWS::ECS::TaskDefinition'
Properties:
# [...]
SecurityGroup:
Type: 'AWS::EC2::SecurityGroup'
Properties:
# [...]

Next, create the state machine. What does the state machine do?

  • Create a Fargate task based on the task definition defined before.
  • Restart the Fargate task up to 3 times in case of a failure.
  • Stop the Fargate task if it has not exited after 600 seconds.
StateMachine:
Type: 'AWS::StepFunctions::StateMachine'
Properties:
DefinitionString: !Sub
- |
{
"Version": "1.0",
"Comment": "Run ECS/Fargate tasks",
"TimeoutSeconds": ${Timeout},
"StartAt": "RunTask",
"States": {
"RunTask": {
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"LaunchType": "FARGATE",
"Cluster": "${Cluster}",
"TaskDefinition": "${TaskDefinition}",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": ["${SubnetA}", "${SubnetB}"],
"AssignPublicIp": "${AssignPublicIp}",
"SecurityGroups": ${SecurityGroups}
}
}
},
"Retry": [
{
"ErrorEquals": [
"States.TaskFailed"
],
"IntervalSeconds": 10,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"End": true
}
}
}
- Cluster: !GetAtt Cluster.Arn
TaskDefinition: !Ref TaskDefinition
SubnetA: 'subnet-00680a8cf7dff9021' # Replace with your subnet
SubnetB: 'subnet-023d074f51ea412d0' # Replace with your subnet
AssignPublicIp: 'ENABLED'
SecurityGroups: !Sub '["${SecurityGroup.GroupId}"]'
Timeout: 600 # 10 minutes
RoleArn: !GetAtt 'StateMachineRole.Arn'

Everything needs an IAM role. The following code snippet shows the IAM role for the state machine.

StateMachineRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: 'states.amazonaws.com'
Action: 'sts:AssumeRole'
Policies:
- PolicyName: StateMachine
PolicyDocument:
Statement:
- Effect: Allow
Action: 'iam:PassRole'
Resource:
- !GetAtt TaskExecutionRole.Arn
- !GetAtt TaskRole.Arn
- Effect: Allow
Action: 'ecs:RunTask'
Resource: !Ref TaskDefinition
Condition:
ArnEquals:
'ecs:cluster': !GetAtt Cluster.Arn
- Effect: Allow
Action:
- 'ecs:StopTask'
- 'ecs:DescribeTasks'
Resource: '*'
Condition:
ArnEquals:
'ecs:cluster': !GetAtt Cluster.Arn
- Effect: Allow
Action:
- 'events:PutTargets'
- 'events:PutRule'
- 'events:DescribeRule'
Resource: !Sub 'arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForECSTaskRule'

There is one crucial part missing: the CloudWatch Event Rule, which triggers the state machine based on a schedule. The following snippet shows the rule which will trigger the state machine every hour (see ScheduleExpression.

Rule:
Type: 'AWS::Events::Rule'
Properties:
ScheduleExpression: 'rate(1 hour)'
State: ENABLED
Targets:
- Arn: !Ref StateMachine
Id: statemachine
RoleArn: !GetAtt 'RuleRole.Arn'
RuleRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: 'events.amazonaws.com'
Action: 'sts:AssumeRole'
Policies:
- PolicyName: EventRulePolicy
PolicyDocument:
Statement:
- Effect: Allow
Action: 'states:StartExecution'
Resource: !Ref StateMachine

Define a scheduled expression in cron or rate style.

A few examples for schedule expressions in cron style:

Schedule Expression Explanation
cron(0 6 * * ? *) 06:00 am (UTC) every day
cron(0 12 * * MON-FRI *) 12:00 am (UTC) from Monday to Friday
cron(0 8 1 * ? *) 08:00 am (UTC) every first day of the month

And more examples for schedule expressions in rate style:

Schedule Expression Explanation
rate(15 minutes) Every 15 minutes
rate(1 hour) Every hour
rate(14 days) Every 14 days

See Schedule Expressions for Rules for more detailed explanations.

One more thing, before we are done. I highly recommend creating two CloudWatch alarms to monitor failed executions and timeouts.

ExecutionsFailedAlarm:
Condition: HasAlertingModule
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Failure while executing scheduled task.'
Namespace: 'AWS/States'
MetricName: ExecutionsFailed
Dimensions:
- Name: StateMachineArn
Value: !Ref StateMachine
Statistic: Sum
Period: 300
DatapointsToAlarm: 1
EvaluationPeriods: 1
Threshold: 0
TreatMissingData: notBreaching
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- 'arn:aws:sns:eu-west-1:111111111111:alerting' # Replace with SNS topic ARN
ExecutionsTimeoutAlarm:
Condition: HasAlertingModule
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Executing scheduled task timed out.'
Namespace: 'AWS/States'
MetricName: ExecutionsTimedOut
Dimensions:
- Name: StateMachineArn
Value: !Ref StateMachine
Statistic: Sum
Period: 300
DatapointsToAlarm: 1
EvaluationPeriods: 1
Threshold: 0
TreatMissingData: notBreaching
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- 'arn:aws:sns:eu-west-1:111111111111:alerting' # Replace with SNS topic ARN

Summary

That’s all! Run scheduled tasks resiliently with the help of a CloudWatch Events Rule, a Step Functions state machine, and Fargate. Using ECS Scheduled Tasks or even worse, cron is not an option.

Are you looking for an in-depth code example? Our book and online seminar Rapid Docker on AWS contains scheduling jobs on Fargate as well. Get your copy of the ebook or register for the online seminar now.

Andreas Wittig

Andreas Wittig

I’m the author of Amazon Web Services in Action. I work as a software engineer, and independent consultant focused on AWS and DevOps.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me
Cover of Rapid Docker on AWS

New book: Rapid Docker on AWS

A rapid way to get your web application up and running on AWS. Made for web developers and DevOps engineers who want to dockerize their web applications and run their containers on Amazon Web Services. Prior knowledge of Docker and AWS is not required.

Buy icon
Buy now
Marbot Logo

Incident Management for Slack

Team up to solve incidents with our chatbot marbot. Never miss a critical alert. Escalate alerts from your AWS infrastructure among your team members. Strong integrations with all parts of your AWS infrastructure: CloudWatch, Elastic Beanstalk, RDS, EC2, ...

Slack icon
Try for free
📚 Rapid Docker on AWS
A rapid way to get your web application up and running on AWS. Learn how to package your application into Docker containers. Learn more.