I'm losing trust in AWS. SNS is broken for 24 days.

Andreas Wittig – 24 Sep 2020 (updated 29 Sep 2020)

I’m frustrated. A major service of AWS is broken for 24 days. The Simple Notification Service (SNS) delivers messages to HTTPS subscriptions with a delay of more than 30 minutes. That issue impacts our SaaS business. But AWS did not fix the problem yet and did not even reveal an ETA for resolving the issue.

I'm losing trust in AWS. SNS is broken for 24 days.

The Problem

Our SaaS business runs on a Serverless architecture. Our customers send all kinds of alarms from their AWS infrastructure to our chatbot. To do so, our solution configures CloudWatch alarms and an SNS topic within our customer’s AWS accounts. The SNS topic forwards all incoming alarms to our API Gateway by using an HTTP subscription.

Our Serverless architecture: SNS sends messages to API Gateway

On September 1st, a customer wrote in: they observed that CloudWatch alarms showed up in Slack with a delay of more than 30 minutes. A monitoring and incident management solution is kind of worthless with delayed alarms. Therefore, I started investigating immediately.

First of all, I had a look at the AWS Service Health Dashboard. All systems operating normally.

All systems operating normally?

Next, I analyzed our log messages to track down the issue. But I could not find any delayed messages—incoming alarms were delivered to Slack within milliseconds after arriving at our API Gateway.

I was wondering how to find out whether SNS caused the delay. But how to investigate an issue like that? Luckily, I stumbled upon delivery status logging. A SNS topic is capable of writing delivery logs to CloudWatch Logs. The perfect way to debug a problem like that.

I found log messages similar to this one. SNS sent a message to api.marbot.io, and our API Gateway answered with status code 204. SNS tried to deliver the message once. The important information is dwellTimeMs = 2748244. It took SNS about 45 minutes to send the alarm to our backend.

{
"notification": {
"messageMD5Sum": "cce294ff201662ea5c91bd5f7391e086",
"messageId": "60926e45-a356-5d80-8e26-39a3cddfdab5",
"topicArn": "arn:aws:sns:us-east-1:xxx",
"timestamp": "2020-09-24 09:52:09.633"
},
"delivery": {
"deliveryId": "bbd19bef-563e-5d9a-8e7a-cc092f7b993f",
"destination": "https://api.marbot.io/v1/endpoint/xxx",
"providerResponse": "No Content",
"dwellTimeMs": 2748244,
"attempts": 1,
"statusCode": 204
},
"status": "SUCCESS"
}

And this is not an outlier. The same is true for all other messages as well. Immediately, I’ve contacted AWS Support. Unfortunately, the story does not end here.

Free Monitoring Checklist + Mind Map

Find the blind spots in your AWS monitoring!

Setting up monitoring on AWS is hard. AWS provides countless features and sources of events. Overlooking the important settings is easy. Our prioritized checklist includes all parts of a basic monitoring setup for AWS. Additionally, use our mind map to map your monitoring goals to AWS services. Download Free Monitoring Checklist + Mind Map!

A Workaround

After the typical back and forth between the support engineer and the service team, I was told to remove the rate limit from our HTTP subscription. AWS announced that feature in December 2011, so I would expect that to be pretty stable, but hey. I’ve removed the throttling policy - the parameter is named maxReceivesPerSecond - and indeed, doing so fixed the problem.

Let me clarify. We are using a rate limit of 1 message per second to avoid flooding our API Gateway in case of misconfigured alarms. Typically only a few messages pass the SNS topic per hour! We are far away from reaching the rate limit. Also, we are talking about the rate limit enforced by our API Gateway, but by our customer’s SNS topics or subscriptions. However, there enabling a rate limit on your SNS topics or subscriptions might cause serious delays.

Fine, there is a workaround for the issue.

All we have to do, is to remove the throttlePolicy from all SNS topics and subscriptions. Our default delivery retry policy looks like this.

{
"healthyRetryPolicy": {
"minDelayTarget": 1,
"maxDelayTarget": 60,
"numRetries": 100,
"numMaxDelayRetries": null,
"numNoDelayRetries": 0,
"numMinDelayRetries": null,
"backoffFunction": "exponential"
},
"sicklyRetryPolicy": null,
"throttlePolicy": {
"maxReceivesPerSecond": 1
},
"guaranteed": false
}

However, implementing the workaround is a challenge for us. We need to update about 1,000 SNS topics. And unfortunately, those SNS topics are not part of our AWS accounts but are managed by our customers. Therefore it is quite expensive to roll out the recommended workaround.

Losing Trust

We love our Serverless architecture. However, our business depends on AWS to operate all the involved services (SNS, API Gateway, Lambda, DynamoDB, Kinesis, Step Functions, etc.) professionally and fix any issues within hours.

Unfortunately, more than 24 days passed, and AWS did not fix the problem. SNS messages are still delayed. Neither AWS Support nor any AWS employee that I have contacted could help with the issue. Until today, we do not even know when AWS is planning to fix the problem - “Unfortunately, I cannot provide any ETA.”. The AWS Service Health Dashboard still states All systems operating normally. It is not!

I’m building on AWS for more than eight years. I’ve never experienced anything like that before. My trust in AWS has strongly decreased over the past month.

Update

This blog post gained traction. It did not take too long until AWS wrote in. I had some background talks about our issue. Not sure what I am allowed to share from these conversations. Luckily, Jeff Barr - Chief Evangelist for AWS - posted a response to our blog post on Reddit publicly. Let me summarize the reaction for you.

  1. The SNS team acknowledges an issue with HTTP subscriptions when using a throttling policy (maxReceivesPerSecond).
  2. The SNS team points out that only a few customers/messages are affected by the issue.
  3. The SNS team promises to fix the problem until October 29, 2020.

I’ve collected some data to quantify the number of affected customers on our side.

20 % of our customers receive at least 50% of their alarms with a delay of more than 15 minutes.

Also, based on the data we collected, it seems to me that some regions are more affected than others. The following table shows the 90th percentile for a period of 24 hours for different regions.

Region Delay (ms)
us-east-1 7011903
us-west-2 1020785
ap-southeast-1 178609
eu-central-1 93179
sa-east-1 89614
eu-west-3 65389
ap-southeast-2 36381
eu-west-2 22211
eu-west-1 1893
ap-south-1 831
ap-northeast-1 699
us-east-2 367
us-west-1 350
ca-central-1 331

I’m happy to hear that AWS is working on fixing the issue that causes us many troubles. I’m a little bit disappointed that we might have to wait another month for a fix. Let’s hope that AWS learned from that issue and will improve their processes to escalate issues like that much faster.

Tags: aws sns support
Andreas Wittig

Andreas Wittig

I’m an independent consultant, technical writer, and programming founder. All these activities have to do with AWS. I’m writing this blog and all other projects together with my brother Michael.

In 2009, we joined the same company as software developers. Three years later, we were looking for a way to deploy our software—an online banking platform—in an agile way. We got excited about the possibilities in the cloud and the DevOps movement. It’s no wonder we ended up migrating the whole infrastructure of Tullius Walden Bank to AWS. This was a first in the finance industry, at least in Germany! Since 2015, we have accelerated the cloud journeys of startups, mid-sized companies, and enterprises. We have penned books like Amazon Web Services in Action and Rapid Docker on AWS, we regularly update our blog, and we are contributing to the Open Source community. Besides running a 2-headed consultancy, we are entrepreneurs building Software-as-a-Service products.

We are available for projects.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me