How to monitor container workloads running on ECS and Fargate?

Andreas Wittig – 18 Oct 2023

How do you monitor a container workload running on ECS (Elastic Container Service) and Fargate with on-board resources? Here are the prioritized aspects when it comes to monitoring containers on AWS.

Event-driven monitoring with EventBridge
Monitoring entry points like ALB, SQS, and Kinesis
Monitoring inter-service communication (Service Connect)
Observing container utilization
Collecting and analyzing container logs

Event-driven monitoring with EventBridge

Most importantly, ensure that you are not missing ECS failure events. Like many AWS services, ECS sends events to EventBridge. Monitoring those events by creating EventBridge rules is crucial to get informed about container-related issues.

For example, the following pattern filters events indicating that an ECS task stopped because one of the essential containers exited with an error.

{
  "source": [ 
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "group": [{"anything-but": {"prefix": "service:"}}],
    "lastStatus": ["STOPPED"],
    "stopCode": ["EssentialContainerExited"],
    "containers": {
      "exitCode": [{"anything-but": 0}]
    }
  }
}

Besides that, an EventBridge rule with the following pattern will watch for failed ECS deployments.

{
  "source": [ 
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Deployment State Change"
  ],
  "detail": {
    "eventName": [
      "SERVICE_DEPLOYMENT_FAILED"
    ]
  }
}

On top of that, use EventBridge rules to monitor tasks that are failing when starting, failed ECS service actions, or ECS tasks stopping due to Fargate Spot interruption.

marbot - our AWS monitoring solution - deploys the necessary EventBridge rules to all your AWS accounts automatically and delivers alerts or notifications to Slack or Microsoft Teams.

Monitoring entry points like ALB, SQS, and Kinesis

Monitoring ECS events in real-time is a good start. But monitoring entry points like the ALB (Application Load Balancer) or SQS (Simple Queue Service) is essential.

Monitoring entrypoints like ALB, SQS, and Kinesis

To do so, create CloudWatch alarms monitoring the following metrics.

ALB
- HTTPCode_ELB_5XX_Count to monitor 5XX errors sent from ALB to the client.
- TargetResponseTime to monitor the response latency.
SQS
- ApproximateAgeOfOldestMessage to monitor for messages that are not getting processed.
- ApproximateNumberOfMessagesVisible to monitor for messages piling up in the queue.
Kinesis Data Stream
- GetRecords.IteratorAgeMilliseconds to monitor for shards not getting processed.

Monitoring those metrics ensures that you get notified as soon as users experience issues but do not create too many notifications, causing alert fatigue.

Monitoring inter-service communication (Service Connect)

ECS has three different inter-service communication options: Service Discovery, Service Connect, and App Mesh.

With service discovery, there is no built-in mechanism for monitoring. Service Connect provides CloudWatch metrics. App Mesh uses Envoy under the hood, which provides metrics but does not integrate with CloudWatch by default.

If Service Connect is used for inter-service communication within an ECS cluster, monitor the following CloudWatch metrics.

HTTPCode_Target_5XX_Count The number of responses with 5XX error code.
TargetResponseTime The time elapsed (milliseconds) after the request reached the Service Connect proxy in the target task until the proxy receives a response from the target container.

Observing container utilization

By default, ECS provides the following utilization metrics for an ECS service.

CPUUtilization The CPU utilization among all tasks belonging to the service.
MemoryUtilization The memory utilization among all tasks belonging to the service.

ECS records additional metrics after enabling Container Insights for a cluster. Among them are the following utilization metrics.

EphemeralStorageReserved and EphemeralStorageUtilized to get insights into the storage utilization (only available for Fargate tasks/containers).
StorageReadBytes and StorageWriteBytes to get insights into the storage throughput.
NetworkRxBytes and NetworkTxBytes to get insights into the networking throughput.

Collecting and analyzing container logs

By default, ECS ships log messages to CloudWatch Logs. Compared to other solutions, CloudWatch Logs comes with zero operations and maintenance effort. With CloudWatch Logs Insights, the capabilities to analyze log messages for debugging come close to other solutions like the Elastic stack.

Analyzing log messages with CloudWatch Logs Insights

Summary

To avoid blind spots when monitoring container workloads running on ECS and Fargate, consider the following aspects:

Event-driven monitoring with EventBridge
Monitoring entry points like ALB, SQS, and Kinesis
Monitoring inter-service communication (Service Connect)
Observing container utilization
Collecting and analyzing container logs

Andreas Wittig

I’ve been building on AWS since 2012 together with my brother Michael. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, attachmentAV, HyperEnv, and marbot.

Here are the contact options for feedback and questions.

How to monitor container workloads running on ECS and Fargate?

Event-driven monitoring with EventBridge

Monitoring entry points like ALB, SQS, and Kinesis

Monitoring inter-service communication (Service Connect)

Observing container utilization

Collecting and analyzing container logs

Summary

Andreas Wittig

Further reading