The Lambda monitoring blind spot

Andreas Wittig – 04 Oct 2023

After a customer complained that a feature of marbot, our monitoring solution for AWS was not working as expected, I started debugging the issue. First, I checked the CloudWatch alarms we use to monitor all Lambda functions. All CloudWatch alarms were in status OK, and we also had not received any alerts via Slack. Next, I analyzed the CloudWatch logs. To my surprise, I found out that one of our Lambda functions failed from time to time. I was shocked about the blind spot in our monitoring configuration.

Are you using CloudWatch alarms for Lambda function monitoring as well? Read on to ensure you avoid making the same mistake we did.

The Lambda monitoring blind spot

Problem

For some reason, the CloudWatch alarms we configured to get notified about failed executions of Lambda functions did not work correctly. Here is an excerpt from our CloudFormation code to configure CloudWatch alarms.

The ErrorsAlarm monitors the Error metric of the LambdaFunction. As soon as the number of errors within the past 5 minutes exceeds 0, the alarm flips to state ALARM.

LambdaFunction:
Type: 'AWS::Lambda::Function'
Properties:
Architectures: ['arm64']
Handler: 'index.handler'
Runtime: 'nodejs18.x'
MemorySize: 1536
Timeout: 900 # 15 min
# ...
ErrorsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'An error occurred while executing the Lambda function.'
Namespace: 'AWS/Lambda'
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: Sum
Period: 300 # 5 min
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold

Sounds fine. Here is the catch.

“The timestamp on a metric reflects when the function was invoked. Depending on the duration of the invocation, this can be several minutes before the metric is emitted. For example, if your function has a 10-minute timeout, then look more than 10 minutes in the past for accurate metrics.” (see Working with Lambda function metrics)

The following figure illustrates that when Lambda writes metric data, it uses the timestamp of the function invocation (start).

CloudWatch alarm monitoring a Lambda function: CloudWatch Evaluation Period must cover at least the Function Timeout Period

In our case, we set the timeout of the LambdaFunction to a maximum of 15 minutes. But the CloudWatch alarm looks back only 5 minutes. As the invocation timestamp is used when inserting a metric point into the Errors metric, the CloudWatch alarm misses errors from invocations longer than 5 minutes.

Solution

To avoid blind spots when monitoring Lambda functions with CloudWatch alarms, stick to the following rule.

CloudWatch Evaluation Period > Lambda Function Timeout

Back to our case, we increased the evaluation period of the ErrorsAlarm to 20 minutes by increasing the evaluation periods from 1 to 4.

LambdaFunction:
Type: 'AWS::Lambda::Function'
Properties:
Architectures: ['arm64']
Handler: 'index.handler'
Runtime: 'nodejs18.x'
MemorySize: 1536
Timeout: 900 # 15 min
# ...
ErrorsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'An error occurred while executing the Lambda function.'
Namespace: 'AWS/Lambda'
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref LambdaFunction
Statistic: Sum
Period: 300 # 5 min
EvaluationPeriods: 4 # 4 x 5 min = 20 minutes
Threshold: 0
ComparisonOperator: GreaterThanThreshold

So, check the configuration of your CloudWatch alarms monitoring Lambda functions!

Andreas Wittig

Andreas Wittig

I’ve been building on AWS since 2012 together with my brother Michael. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV,HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.