Resilient task scheduling with ECS Fargate
Many applications use scheduled jobs to automate recurring tasks, such as:
- Generating and sending a monthly report.
- Disabling users who haven’t logged in for more than 365 days.
- Deleting stale data from the database.
Doing so is simple, as long as an application runs on a single machine. In that case,
cron, a time-based job scheduler does the trick. But what to do when your workload is running on ECS and Fargate?
Read on to learn why using
cron or ECS scheduled tasks is not an option. On top of that, you will get to know an advanced solution for scheduling jobs on ECS and Fargate.
It is an obvious but wrong choice to use
cron to trigger scheduled jobs.
- It is an anti-pattern to run more than one process per container. Therefore, running your application and
cronin one container is a no go.
- There is no guarantee that ECS is running your container 24/7 exactly once. Therefore, using
croninside a container may cause jobs to run multiple times or not at all.
- It is not very cost-effective to run a container on Fargate 24/7 to execute a job a few times per day/week/month.
AWS proposes the following solution in their documentation:
- Open the AWS Management Console.
- Select your ECS cluster.
- Create a Scheduled Task based on a fixed interval or cron-like expression.
Behind the scenes, AWS is creating a CloudWatch event rule which starts an ECS task based on the defined schedule, as shown in the following figure.
I highly recommend not to use this approach because it is missing an important aspect: the monitoring of the scheduled job. Neither does this approach provide a way to define a timeout for the scheduled job, nor does it retry a failed job. Instead, ECS scheduled tasks are operating in fire-and-forget mode. In summary, this approach is not resilient.
Nevertheless, there is a more resilient solution to schedule jobs with ECS and Fargate. As shown in the figure below, three components work together to schedule jobs:
- CloudWatch Events Rule: triggers the state machine based on a schedule.
- Step Functions: a state machine orchestrating simple or complex workflows. In this scenario, the state machine starts a container and waits until the container exits.
- Fargate: the computing engine for the container executing the scheduled job.
The state machine is monitoring the health of the scheduled job. If necessary, the state machine will retry a failed job. Additionally, you should define a timeout for a scheduled job. When doing so, the state machine will stop a scheduled job after reaching the timeout to avoid running jobs endlessly, for example, in case of a misconfiguration.
Next, you will learn how to create a CloudWatch Events Rule, Step Functions state machine, and Fargate task definition with the help of CloudFormation.
First, you need to create the basic ECS and Fargate infrastructure consisting of an ECS Cluster, task definition, and security group. I’m skipping the details of how to do so here.
Next, create the state machine. What does the state machine do?
- Create a Fargate task based on the task definition defined before.
- Restart the Fargate task up to 3 times in case of a failure.
- Stop the Fargate task if it has not exited after 600 seconds.
Everything needs an IAM role. The following code snippet shows the IAM role for the state machine.
There is one crucial part missing: the CloudWatch Event Rule, which triggers the state machine based on a schedule. The following snippet shows the rule which will trigger the state machine every hour (see
Define a scheduled expression in cron or rate style.
A few examples for schedule expressions in cron style:
||06:00 am (UTC) every day|
||12:00 am (UTC) from Monday to Friday|
||08:00 am (UTC) every first day of the month|
And more examples for schedule expressions in rate style:
||Every 15 minutes|
||Every 14 days|
See Schedule Expressions for Rules for more detailed explanations.
One more thing, before we are done. I highly recommend creating two CloudWatch alarms to monitor failed executions and timeouts.
That’s all! Run scheduled tasks resiliently with the help of a CloudWatch Events Rule, a Step Functions state machine, and Fargate. Using ECS Scheduled Tasks or even worse,
cron is not an option.