Resilient event-driven Serverless architectures: Isolate your dependencies

Michael Wittig – 10 Jul 2020

Most systems grow over time. Dependencies are added, and availability suffers. How can we design an event-driven serverless architecture that stays resilient if we add dependencies? In this blog post, I walk you through a real-world application that isolates dependencies using Kinesis data streams.

Isolate your dependencies

Early this year, Andreas and I decided to research how we can integrate Microsoft Teams with our ChatOps product marbot to increase our customer base. At the end of April, we launched a private beta with a couple of early adopters. On June 19, we launched marbot for Microsoft Teams.

I shared the first version of marbot’s architecture three years ago. Followed by my Evolutionary Serverless Architecture talk in 2019. Today, I provide an update on the marbot architecture and show you how we added Microsoft Teams support without disrupting our Slack customers.

The following figure shows the current marbot architecture. I explain the details in the following (starting from the bottom left).

marbot architecture v1

Most of our customers use marbot to connect to over 50 AWS sources to get notified if things go wrong. Depending on the AWS service, an EventBridge Rule, CloudWatch Alarm, or more service-specific features are used to subscribe to the events. All events are published to an SNS topic and forwarded to marbot via HTTPS.

If marbot receives an event, we:

  1. Parse the event.
  2. Extract important information (AWS account, region, instance id, etc.).
  3. Classify the event (alert or notification).
  4. Check if similar events are available to aggregate them (deduplication).
  5. Generate links to the AWS Management Console.
  6. Start an escalation chain and send out the first message to Slack or Microsoft Teams.

We learned the hard way that external dependencies (e.g., Slack API) are unreliable. Therefore, our processing happens asynchronously and in an idempotent way. This allows us to receive all events from our customers while we retry until the external dependency becomes available again to deliver each alert eventually. marbot’s architecture is designed never to lose alerts from our customers.

With the introduction of a new external dependency, we asked one question: Will this lower our availability? If our processing logic relies on Slack and Microsoft Teams, the availability of marbot will decrease. Let’s look at one example:

Availability = Kinesis Data Stream Availability x Lambda Availability x DynamoDB availability x Slack Availability x Microsoft Teams Availability

We can simplify and concentrate on the non-AWS dependencies. Before:

Availability = Slack Availability
0.99 = 0.99

After adding Microsoft Teams:

Free Monitoring Checklist + Mind Map

Find the blind spots in your AWS monitoring!

Setting up monitoring on AWS is hard. AWS provides countless features and sources of events. Overlooking the important settings is easy. Our prioritized checklist includes all parts of a basic monitoring setup for AWS. Additionally, use our mind map to map your monitoring goals to AWS services. Download Free Monitoring Checklist + Mind Map!

Availability = Slack Availability x Microsoft Teams Availability
0.98 = 0.99 x 0.99

As you can see, the availability is reduced if we add a dependency. To avoid this problem, we need to isolate the processing of Slack and Microsoft Teams customers. If Slack is down, we don’t want to interrupt our Microsoft Teams customers and vice versa.

We already had one Kinesis stream. We decided that the existing stream is used only for Slack customers. A second Kinesis stream was added to deal with Microsoft Teams customers. The existing code was changed to route messages to the appropriate stream. With this design, our Microsoft Teams customers will not notice if Slack is down! If our new Microsoft Teams code has bugs, we will not impact our happy Slack customers.

Besides our logic that runs behind Kinesis streams, we also have logic that runs behind a DynamoDB stream. This logic is not impacting the availability of marbot. Therefore, we have not introduced an isolation layer.

We also have a bunch of Step Function state machines to implement:

  • Escalation chains.
  • Onboarding functionality to welcome new users.
  • Clean up tasks to delete data if marbot is uninstalled.

The cool thing about a Step Function execution is that each execution is isolated by default. Therefore, we use some of the state machines for Slack and Microsoft Teams customers. But we also have state machines that are needed only in the context of Slack or Microsoft Teams.

Summary

If you add additional dependencies to your architecture, you likely decrease your system’s availability. If possible, try to isolate the dependencies not to lower the availability of your system.

Become a cloudonaut supporter

Michael Wittig

Michael Wittig ( Email, Twitter, or LinkedIn )

We launched the cloudonaut blog in 2015. Since then, we have published 360 articles, 50 podcast episodes, and 48 videos. It's all free and means a lot of work in our spare time. We enjoy sharing our AWS knowledge with you.

Please support us

Have you learned something new by reading, listening, or watching our content? With your help, we can spend enough time to keep publishing great content in the future. Learn more

$
Amount must be a multriply of 5. E.g, 5, 10, 15.

Thanks to Alan Leech, Alex DeBrie, ANTHONY RAITI, Christopher Hipwell, Jaap-Jan Frans, Jason Yorty, Jeff Finley, Jens Gehring, jhoadley, Johannes Grumböck, Johannes Konings, John Culkin, Jonas Mellquist, Juraj Martinka, Kamil Oboril, Ken Snyder, Markus Ellers, Ross Mohan, Ross Mohan, sam onaga, Satyendra Sharma, Shawn Tolidano, Simon Devlin, Thorsten Hoeger, Todd Valentine, Victor Grenu, and all anonymous supporters for your help! We also want to thank all supporters who purchased a cloudonaut t-shirt.