Monitoring a critical part of your infrastructure: Amazon ElastiCache memcached cluster

Michael Wittig – 22 Jan 2018

In most of my projects where end-user latency is important, I usually add a caching layer to the architecture. The goal of a caching layer is to reduce load from the database and the speed up the most popular data retrievals. In one project, I was asked to investigate an increase in latency over the last months. Just by looking at the CloudWatch metrics, I was discovering that the memcached cluster had a high number of evictions. An eviction occurs when a new item is added to the cluster that has not enough memory to store the new item. The cluster will remove old items to make room for the new item. So, the technical reason was: the cluster had not enough memory to hold all the cached data. Old items were evicted and therefore no longer cached which caused the increased latency for requests that wanted to access those no-longer-cached items. The business reason was that the number of users was growing. Three months before, the cache was large enough. But now, with twice the amount of users, the cache cluster is too small. Bottom line is: always monitor your critical components to get notified of such problems before they occur.

Amazon ElastiCache provides Redis and memcached as a Service. The fully managed service covers a lot of the challenges of operating an in-memory cache (e.g., cluster management, patching the operating system and the caching system, …). But you are still responsible for some operational aspects: sizing and performance optimizations. Therefore, you need to monitor every ElastiCache cluster that serves production workloads.

This blog post covers memcached. Redis is not in scope.

Monitoring your whole cloud infrastructure is a complex task, as Andreas pointed out in his AWS Monitoring Primer. In this blog post, I will focus on the relevant parts for monitoring your ElastiCache memcached cluster:

I guide you to the relevant AWS monitoring services and features offered by AWS.
I present best practices based on real-world client projects.
I provide a CloudFormation template that implements all ideas in the post.
You can use the template to monitor any ElastiCache memcached cluster in a minute.

Let’s get started!

Identifying important CloudWatch metrics

Each ElastiCache memcached cluster sends metrics to CloudWatch.

CloudWatch metrics expose internals of the ElastiCache memcached cluster

The most important metrics are:

area	metric	description	relevance
CPU	CPUUtilization	The percentage of CPU utilization.	If the CPU is highly utilized, latency is added because computing tasks have to wait until they are scheduled.
Memory	Evictions	The number of non-expired items the cache evicted to allow space for new writes.	Items are evicted if you are running out of memory.
Memory	SwapUsage	The amount of swap used on the host in bytes.	If memory is moved to disk performance usually suffers.

Once important metrics are identified, you can use them to understand how a healthy system differs from an impacted system.

Defining thresholds

One of the hardest parts of monitoring is to define what healthy means. For each metric, you have to define a threshold between healthy and impacted. E.g., you regard CPU utilization under 80% as healthy because the application was never impacted when the CPU was not utilized. Thresholds are defined based on observations from the past. They might need adjustment in the future.

We don’t know about the whole application here. We can only reason about one component: the cache. Application monitoring is a different topic. E.g., HTTP 5XX responses, latency, sign-ups.

From our experience and the AWS documentation, we usually start with the following thresholds to identify unhealthy behavior and adjust them over time.

area	metric	comparison operator	threshold	rationale
CPU	CPUUtilization	>	80 %	Queuing theory tells us the latency increases exponentially with utilization. In practice, we see higher latency when utilization exceeds 80% and unacceptable high latency with utilization above 90%.
Memory	Evictions	>	1000	This number is calculated from our experience with ElastiCache workloads. 1000 evictions per second with an item size of 10 KB imply the cluster is releasing 10 MB of memory per second due to evictions.
Memory	SwapUsage	>	256 MB	Sometimes you can not entirely avoid swapping. But once the cache accesses paged memory, it will slow down.

Now you know what healthy/unhealthy means. It’s time to define CloudWatch Alarms to send you an alert if a metric exceeds its threshold.

Observing metrics with CloudWatch Alarms and marbot

A CloudWatch Alarm continuously watches a metric. Once the threshold is reached, an action is performed that sends a message to an SNS topic. From this topic, you can then send yourself an email. We found that emails are not a good way to handle alerts. In a team, multiple people are responsible. If you send an email to a group email address:

Your team has no idea if someone already started to work on solving the issue.
You disturb the whole team for each alert.
It’s easy to ignore an email.
You have no statistics about how many alerts are generated. Too many alerts are an indication that your team is no longer able to handle them.
No help to investigate the issue is available, like links to the AWS Management Console.

To solve the problem, we built marbot: a Slack chatbot that manages and escalates AWS alerts for you.

marbot forwards alerts to Slack

marbot sends alerts to a single user from the Slack channel via a direct message. If the user doesn’t acknowledge the alert within 5 minutes, marbot will escalate to the next level. Escalations minimize distraction while keeping response time low. Try marbot for free now.

Other sources

Besides metrics, ElastiCache sends out notifications if the state of the cache cluster has changed. E.g., because of a node failure. Unfortunately, you can not filter the types of notifications in ElastiCache. You have to filter them on your side as marbot does.

A sample alert follows:

marbot delivers ElastiCache notifications to Slack

CloudFormation template

We developed a CloudFormation template to monitor an ElastiCache memcached cluster in any region. The template integrates with marbot, but you can modify it to send out emails. The template is available on GitHub for free.

If you have already installed marbot, you can also ask marbot to monitor your ElastiCache memcached cluster or read more detailed setup instructions. Otherwise: Try marbot for free now.

Summary

There are multiple options available to monitor an ElastiCache memcached cluster. Most importantly, CloudWatch Metrics and Alarms. But you should also not forget about ElastiCache notifications. Otherwise, you miss events like node failures.

CloudWatch Alarms can trigger actions. The obvious choice is to send out an email if a metric exceeds a threshold. But we recommend not to use emails. Instead, use a tool like marbot. marbot comes with alert escalation, deduplication, and context-aware links to the AWS Management Console.

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.