Resilient task scheduling with ECS Fargate
Many applications use scheduled jobs to automate recurring tasks, such as:
- Generating and sending a monthly report.
- Disabling users who haven’t logged in for more than 365 days.
- Deleting stale data from the database.
Doing so is simple, as long as an application runs on a single machine. In that case, cron
, a time-based job scheduler does the trick. But what to do when your workload is running on ECS and Fargate?
Read on to learn why using cron
or ECS scheduled tasks is not an option. On top of that, you will get to know an advanced solution for scheduling jobs on ECS and Fargate.
Don’t: Use cron inside a container
It is an obvious but wrong choice to use cron
to trigger scheduled jobs.
- It is an anti-pattern to run more than one process per container. Therefore, running your application and
cron
in one container is a no go. - There is no guarantee that ECS is running your container 24/7 exactly once. Therefore, using
cron
inside a container may cause jobs to run multiple times or not at all. - It is not very cost-effective to run a container on Fargate 24/7 to execute a job a few times per day/week/month.
Don’t: Use ECS scheduled tasks
AWS proposes the following solution in their documentation:
- Open the AWS Management Console.
- Select your ECS cluster.
- Create a Scheduled Task based on a fixed interval or cron-like expression.
Behind the scenes, AWS is creating a CloudWatch event rule which starts an ECS task based on the defined schedule, as shown in the following figure.
I highly recommend not to use this approach because it is missing an important aspect: the monitoring of the scheduled job. Neither does this approach provide a way to define a timeout for the scheduled job, nor does it retry a failed job. Instead, ECS scheduled tasks are operating in fire-and-forget mode. In summary, this approach is not resilient.
Do: Use Step Functions to start and monitor an ECS task
Nevertheless, there is a more resilient solution to schedule jobs with ECS and Fargate. As shown in the figure below, three components work together to schedule jobs:
- CloudWatch Events Rule: triggers the state machine based on a schedule.
- Step Functions: a state machine orchestrating simple or complex workflows. In this scenario, the state machine starts a container and waits until the container exits.
- Fargate: the computing engine for the container executing the scheduled job.
The state machine is monitoring the health of the scheduled job. If necessary, the state machine will retry a failed job. Additionally, you should define a timeout for a scheduled job. When doing so, the state machine will stop a scheduled job after reaching the timeout to avoid running jobs endlessly, for example, in case of a misconfiguration.
Next, you will learn how to create a CloudWatch Events Rule, Step Functions state machine, and Fargate task definition with the help of CloudFormation.
First, you need to create the basic ECS and Fargate infrastructure consisting of an ECS Cluster, task definition, and security group. I’m skipping the details of how to do so here.
Cluster: |
Next, create the state machine. What does the state machine do?
- Create a Fargate task based on the task definition defined before.
- Restart the Fargate task up to 3 times in case of a failure.
- Stop the Fargate task if it has not exited after 600 seconds.
StateMachine: |
Everything needs an IAM role. The following code snippet shows the IAM role for the state machine.
StateMachineRole: |
There is one crucial part missing: the CloudWatch Event Rule, which triggers the state machine based on a schedule. The following snippet shows the rule which will trigger the state machine every hour (see ScheduleExpression
.
Rule: |
Define a scheduled expression in cron or rate style.
A few examples for schedule expressions in cron style:
Schedule Expression | Explanation |
---|---|
cron(0 6 * * ? *) |
06:00 am (UTC) every day |
cron(0 12 * * MON-FRI *) |
12:00 am (UTC) from Monday to Friday |
cron(0 8 1 * ? *) |
08:00 am (UTC) every first day of the month |
And more examples for schedule expressions in rate style:
Schedule Expression | Explanation |
---|---|
rate(15 minutes) |
Every 15 minutes |
rate(1 hour) |
Every hour |
rate(14 days) |
Every 14 days |
See Schedule Expressions for Rules for more detailed explanations.
One more thing, before we are done. I highly recommend creating two CloudWatch alarms to monitor failed executions and timeouts.
ExecutionsFailedAlarm: |
Summary
That’s all! Run scheduled tasks resiliently with the help of a CloudWatch Events Rule, a Step Functions state machine, and Fargate. Using ECS Scheduled Tasks or even worse, cron
is not an option.
Further reading
- Article ECS vs. Fargate: What's the difference?
- Article Fargate networking 101
- Article Monitoring EC2 Network Utilization
- Tag container
- Tag ecs
- Tag fargate