Resilient event-driven Serverless architectures: Isolate your dependencies

Michael Wittig – 10 Jul 2020

Most systems grow over time. Dependencies are added, and availability suffers. How can we design an event-driven serverless architecture that stays resilient if we add dependencies? In this blog post, I walk you through a real-world application that isolates dependencies using Kinesis data streams.

Isolate your dependencies

Early this year, Andreas and I decided to research how we can integrate Microsoft Teams with our ChatOps product marbot to increase our customer base. At the end of April, we launched a private beta with a couple of early adopters. On June 19, we launched marbot for Microsoft Teams.

I shared the first version of marbot’s architecture three years ago. Followed by my Evolutionary Serverless Architecture talk in 2019. Today, I provide an update on the marbot architecture and show you how we added Microsoft Teams support without disrupting our Slack customers.

The following figure shows the current marbot architecture. I explain the details in the following (starting from the bottom left).

marbot architecture v1

Most of our customers use marbot to connect to over 50 AWS sources to get notified if things go wrong. Depending on the AWS service, an EventBridge Rule, CloudWatch Alarm, or more service-specific features are used to subscribe to the events. All events are published to an SNS topic and forwarded to marbot via HTTPS.

If marbot receives an event, we:

  1. Parse the event.
  2. Extract important information (AWS account, region, instance id, etc.).
  3. Classify the event (alert or notification).
  4. Check if similar events are available to aggregate them (deduplication).
  5. Generate links to the AWS Management Console.
  6. Start an escalation chain and send out the first message to Slack or Microsoft Teams.

We learned the hard way that external dependencies (e.g., Slack API) are unreliable. Therefore, our processing happens asynchronously and in an idempotent way. This allows us to receive all events from our customers while we retry until the external dependency becomes available again to deliver each alert eventually. marbot’s architecture is designed never to lose alerts from our customers.

With the introduction of a new external dependency, we asked one question: Will this lower our availability? If our processing logic relies on Slack and Microsoft Teams, the availability of marbot will decrease. Let’s look at one example:

Availability = Kinesis Data Stream Availability x Lambda Availability x DynamoDB availability x Slack Availability x Microsoft Teams Availability

We can simplify and concentrate on the non-AWS dependencies. Before:

Availability = Slack Availability
0.99 = 0.99

After adding Microsoft Teams:

Free Monitoring Checklist + Mind Map

Find the blind spots in your AWS monitoring!

Setting up monitoring on AWS is hard. AWS provides countless features and sources of events. Overlooking the important settings is easy. Our prioritized checklist includes all parts of a basic monitoring setup for AWS. Additionally, use our mind map to map your monitoring goals to AWS services. Download Free Monitoring Checklist + Mind Map!

Availability = Slack Availability x Microsoft Teams Availability
0.98 = 0.99 x 0.99

As you can see, the availability is reduced if we add a dependency. To avoid this problem, we need to isolate the processing of Slack and Microsoft Teams customers. If Slack is down, we don’t want to interrupt our Microsoft Teams customers and vice versa.

We already had one Kinesis stream. We decided that the existing stream is used only for Slack customers. A second Kinesis stream was added to deal with Microsoft Teams customers. The existing code was changed to route messages to the appropriate stream. With this design, our Microsoft Teams customers will not notice if Slack is down! If our new Microsoft Teams code has bugs, we will not impact our happy Slack customers.

Besides our logic that runs behind Kinesis streams, we also have logic that runs behind a DynamoDB stream. This logic is not impacting the availability of marbot. Therefore, we have not introduced an isolation layer.

We also have a bunch of Step Function state machines to implement:

  • Escalation chains.
  • Onboarding functionality to welcome new users.
  • Clean up tasks to delete data if marbot is uninstalled.

The cool thing about a Step Function execution is that each execution is isolated by default. Therefore, we use some of the state machines for Slack and Microsoft Teams customers. But we also have state machines that are needed only in the context of Slack or Microsoft Teams.

Summary

If you add additional dependencies to your architecture, you likely decrease your system’s availability. If possible, try to isolate the dependencies not to lower the availability of your system.

Michael Wittig

Michael Wittig

I’m an independent consultant, technical writer, and programming founder. All these activities have to do with AWS. I’m writing this blog and all other projects together with my brother Michael.

In 2009, we joined the same company as software developers. Three years later, we were looking for a way to deploy our software—an online banking platform—in an agile way. We got excited about the possibilities in the cloud and the DevOps movement. It’s no wonder we ended up migrating the whole infrastructure of Tullius Walden Bank to AWS. This was a first in the finance industry, at least in Germany! Since 2015, we have accelerated the cloud journeys of startups, mid-sized companies, and enterprises. We have penned books like Amazon Web Services in Action and Rapid Docker on AWS, we regularly update our blog, and we are contributing to the Open Source community. Besides running a 2-headed consultancy, we are entrepreneurs building Software-as-a-Service products.

We are available for projects.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me