CloudWatch Metrics & Alarms reloaded

Michael Wittig โ€“ 26 Mar 2020

Amazon CloudWatch improved significantly over the years. Itโ€™s time to look at its monitoring capabilities again. CloudWatch is an excellent starting point to implement enhanced monitoring on AWS. In this blog post, I demonstrate what you can do with CloudWatch metrics and alarms. Metrics provide a time-series database for telemetry (e.g., CPU utilization of an EC2 instance). Alarms watch a metric and trigger actions if a threshold is reached.

CloudWatch Metrics & Alarms reloaded

Do you prefer listening to a podcast episode over reading a blog post? Here you go!

Metrics

A metric stores your telemetry data for ๐Ÿ†• 15 months. The highest resolution is ๐Ÿ†• one second. The resolution of the data is automatically reduced over time. The following figure demonstrates how the resolution is reduced over time.

CloudWatch Metric aging

Let me give you an example: If you have data points with a 1-minute resolution that become older than 15 days, CloudWatch will combine 5 data points into one as the following tables show:

TimeValue
08:0981
08:0839
08:0732
08:0636
08:0531

The result is no longer a table of raw values, instead; CloudWatch stores statistics to describe the values:

TimeSamplesMinMaxSum
08:0953181219

You can compute time window statistics (aka aggregations) over your metric data. The following functions are supported: average, minimum, maximum, sum, and ๐Ÿ†• percentiles. Let me give you an example. The following data is stored in CloudWatch.

TimeValue
10:1784
10:1673
10:1590
10:1485
10:1374
10:1283
10:1145
10:1065
10:0981
10:0839
10:0732
10:0636
10:0531
10:0445
10:0356
10:0271

Let me give you an aggregation example: for all values between 10:05 (inclusive) and 10:15 (exclusive), look at 5 minute time windows (aka period), and compute the maximum.

Time WindowValues
[10:10, 10:15[65, 45, 83, 74, 85
[10:05, 10:10[31, 36, 32, 39, 81

The result will look like this:

Time WindowMaximum
[10:10, 10:15[85
[10:05, 10:10[81

๐Ÿ†• Metric Math allows you to combine multiple metrics. For example, we have covered the following use cases in our blog already:

Most AWS Services (e.g., EC2, RDS, and many more) report telemetry data to CloudWatch Metrics out of the box. On top of that, you can also send your own data (aka custom metric).

Last but not least, creating dashboards that show multiple metrics in one place is a handy feature. The following figure shows a CloudWatch dashboard of our product bucketAV - Antivirus for Amazon S3.

bucketAV Dashboard

Alarms

A metric alarm (previously just alarm) continually runs an aggregation over the latest period(s) and checks the result against a threshold. A ๐Ÿ†• composite alarm continually checks other alarms.

Composite alarms are charged twice. You pay for the metric alarms and the composite alarms. You likely donโ€™t need composite alarms, use metric math instead!

An alarm can be in three states:

  • OK: Threshold is not reached
  • ALARM: Threshold is reached
  • INSUFFICIENT_DATA: No data available

Whenever the state of an alarm changes, it triggers an action:

  • Send a message to SNS
  • Execute an EC2 auto-scaling action
  • Execute an EC2 recovery action

๐Ÿ†• You can configure how an alarm deals with missing data:

  • ignore it altogether
  • treat it as OK
  • treat it as ALARM
  • old behavior: go to state INSUFFICIENT_DATA

A simple alarm could be:

If the average CPU utilization over the last 5-minutes period is greater than 80, then send a message to an SNS topic.

placeholdervalue
statisticaverage
metricCPU utilization
period5 minutes
comparatorgreater than
threshold80
actionsend a message to an SNS topic

Formula: If the $statistic $metric over the last $period period is $comparator $threshold, then $action.

To make alarms more stable, you can also look at the last N periods instead of only the last period:

If the average CPU utilization over the last 3 5-minutes periods is greater than 80, then send a message to an SNS topic.

placeholdervalue
statisticaverage
metricCPU utilization
period5 minutes
evaluation-periods3
comparatorgreater than
threshold80
actionsend a message to an SNS topic

Formula: If the $statistic $metric over the last $evaluation-periods $period periods is $comparator $threshold, then $action.

Less prone to short spikes is the usage of ๐Ÿ†• M out of N logic:

If the average CPU utilization over the last 4 5-minutes periods is greater than 80 for at least 2 times, then send a message to an SNS topic.

placeholdervalue
statisticaverage
metricCPU utilization
period5 minutes
evaluation-periods4
datapoints-to-alarm2
comparatorgreater than
threshold80
actionsend a message to an SNS topic

Formula: If the $statistic $metric over the last $evaluation-periods $period periods is $comparator $threshold for at least $datapoints-to-alarm times, then $action.

One problem with the approach so far is that you have to set the threshold. With ๐Ÿ†• anomaly detection, CloudWatch will train a model to predict the threshold based on the past. Anomaly detection will recognize trends of the past 2 weeks. It is aware of hourly, daily, and weekly patterns. Keep in mind that anomaly detection only compares the present with the past two weeks. If your latency is worse every evening, that will just look fine for anomaly detection!

Summary

CloudWatch Metrics and Alarms are getting more powerful every year. Data can now be tracked at a 1-second resolution and is stored for 15 months. The new percentiles statistics is an excellent fit to monitor latencies. Metric math provides rich capabilities to work with multiple metrics at once.

The new composite alarms group multiple alarms together (easy, but more expensive than metric math). With the M out of N logic, you can ensure that your alarms can better deal with short spikes. Finally, anomaly detection can replace a static threshold.

Michael Wittig

Michael Wittig

Iโ€™ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, weโ€™re currently working on bucketAV, attachmentAV, HyperEnv, and marbot.

Here are the contact options for feedback and questions.