AWS Step Functions: How to Orchestrate Workflows Waiting for 3rd Parties
My go-to service for automating workflows in serverless applications is AWS Step Functions. Recently, I was working on an enhancement for HyperEnv, our solution to deploy self-hosted GitHub runners on AWS with ease. The challenge was to track the status of GitHub jobs with the serverless backend of HyperEnv. In the following, I will share the architecture that you can use to orchestrate workflows that need to wait for 3rd parties.

The challenge
GitHub webhooks enable applications to subscribe to events, like the workflow_job event informing about jobs getting created, starting to run, or completing. To ensure HyperEnv provides a self-hosted runner for every job, even in case of failures, HyperEnv needs to keep track of every GitHub job. Doing so allows HyperEnv to retry launching a self-hosted runner in case a job does not start running within 5 minutes, for example.

The solution
The following diagram illustrates the solution. Let me walk you through the architecture diagram from left to right.

First, an API Gateway receives the GitHub webhook events via HTTP and invokes the Lambda function Webhook.
Next, the Lambda function Webhook starts the execution of the state machine Job State Machine.
The following screenshot shows the state machine in more detail.
Startis where the state machine starts.Queuedis the initial state of a GitHub job.InProgressindicates that the GitHub job is running.Completedmeans that the GitHub job finished.Endis the end of the state machine.

Then, the state machine Job State Machine invokes the Lambda function Job State by using the Wait for Callback integration (see Wait for a Callback with Task Token).
The Lambda function Job State creates an item in the DynamoDB table Job State and persists the task token needed to resolve the callback.
If next, GitHub sends a webhook event with a status update, the process continues.
The API Gateway receives the event. Then, the Lambda function Webhook queries the DynamoDB table Job State and fetches the task token. With the task token, the Lambda function Webhook sends a task success signal to the state machine Job State Machine. Which will resolve the wait for the Queued task and continue with the next one InProgress.
In the next section, I will share implementation details of the architecture.
Deep Dive: Step Functions and Lambda with Wait for Callback
The following listing shows an excerpt of the state machine’s definition.
The Lambda function is called with Wait for Callback (see arn:aws:states:::lambda:invoke.waitForTaskToken).
The task token is needed to send a success signal as soon as the state is done. Therefore, $states.context.Task.Token is added to the payload of the Lambda function invocation.
{ |
Next, let’s take a look into a possible implementation of the Lambda function Job State.
Depending on the event.task property, the function updates an item in the DynamoDB table.
The expire_at attribute is used to define a time-to-live for the DynamoDB item to ensure the stored state gets deleted from the table automatically.
The task token is stored in the item’s attribute task_token_queued or task_token_in_progress.
import { DynamoDBClient, UpdateItemCommand } from '@aws-sdk/client-dynamodb'; |
Finally, let’s take a look into the implementation of the Lambda function Webhook.
The following snippet shows how the Lambda function Webhook starts the execution of the state machine Job State Machine for each new GitHub job.
if (body.action === 'queued') { |
Next, when GitHub sends a webhook indicating that the job changed its status from queued to in in_progress the Lambda function executes the following code.
- Fetch the state of the GitHub job from the DynamoDB table
Job State. - Use the task token
task_token_queuedto send a task success notification to the state machine.
if (body.action === 'in_progress') { |
The process is similar for the transition from in_progress to completed.
Summary & Feedback
To track the progress of a workflow depending on 3rd parties, the following patterns are helpful when using AWS Step Functions.
- Use Wait for Callback mode to invoke Lambda functions. Store the task token in DynamoDB. Optionally call the 3rd party. Then wait for a response from the 3rd party.
- As AWS Step Functions does not provide a way to get information about the current task (aka. state), store information about the current state in DynamoDB.
I hope following along with my solution helps you to come up with a suitable approach for your use case. In case you found another approach, please share it with me. I’m happy to learn from you!
