Resilient event-driven Serverless architectures: Isolate your dependencies
Most systems grow over time. Dependencies are added, and availability suffers. How can we design an event-driven serverless architecture that stays resilient if we add dependencies? In this blog post, I walk you through a real-world application that isolates dependencies using Kinesis data streams.
Early this year, Andreas and I decided to research how we can integrate Microsoft Teams with our ChatOps product marbot to increase our customer base. At the end of April, we launched a private beta with a couple of early adopters. On June 19, we launched marbot for Microsoft Teams.
I shared the first version of marbot’s architecture three years ago. Followed by my Evolutionary Serverless Architecture talk in 2019. Today, I provide an update on the marbot architecture and show you how we added Microsoft Teams support without disrupting our Slack customers.
The following figure shows the current marbot architecture. I explain the details in the following (starting from the bottom left).
Most of our customers use marbot to connect to over 50 AWS sources to get notified if things go wrong. Depending on the AWS service, an EventBridge Rule, CloudWatch Alarm, or more service-specific features are used to subscribe to the events. All events are published to an SNS topic and forwarded to marbot via HTTPS.
If marbot receives an event, we:
- Parse the event.
- Extract important information (AWS account, region, instance id, etc.).
- Classify the event (alert or notification).
- Check if similar events are available to aggregate them (deduplication).
- Generate links to the AWS Management Console.
- Start an escalation chain and send out the first message to Slack or Microsoft Teams.
We learned the hard way that external dependencies (e.g., Slack API) are unreliable. Therefore, our processing happens asynchronously and in an idempotent way. This allows us to receive all events from our customers while we retry until the external dependency becomes available again to deliver each alert eventually. marbot’s architecture is designed never to lose alerts from our customers.
With the introduction of a new external dependency, we asked one question: Will this lower our availability? If our processing logic relies on Slack and Microsoft Teams, the availability of marbot will decrease. Let’s look at one example:
Availability = Kinesis Data Stream Availability x Lambda Availability x DynamoDB availability x Slack Availability x Microsoft Teams Availability |
We can simplify and concentrate on the non-AWS dependencies. Before:
Availability = Slack Availability |
After adding Microsoft Teams:
Availability = Slack Availability x Microsoft Teams Availability |
As you can see, the availability is reduced if we add a dependency. To avoid this problem, we need to isolate the processing of Slack and Microsoft Teams customers. If Slack is down, we don’t want to interrupt our Microsoft Teams customers and vice versa.
We already had one Kinesis stream. We decided that the existing stream is used only for Slack customers. A second Kinesis stream was added to deal with Microsoft Teams customers. The existing code was changed to route messages to the appropriate stream. With this design, our Microsoft Teams customers will not notice if Slack is down! If our new Microsoft Teams code has bugs, we will not impact our happy Slack customers.
Besides our logic that runs behind Kinesis streams, we also have logic that runs behind a DynamoDB stream. This logic is not impacting the availability of marbot. Therefore, we have not introduced an isolation layer.
We also have a bunch of Step Function state machines to implement:
- Escalation chains.
- Onboarding functionality to welcome new users.
- Clean up tasks to delete data if marbot is uninstalled.
The cool thing about a Step Function execution is that each execution is isolated by default. Therefore, we use some of the state machines for Slack and Microsoft Teams customers. But we also have state machines that are needed only in the context of Slack or Microsoft Teams.
Summary
If you add additional dependencies to your architecture, you likely decrease your system’s availability. If possible, try to isolate the dependencies not to lower the availability of your system.