Lessons learned: Serverless Chatbot architecture for marbot

Michael Wittig – 30 Jun 2017

marbot forwards alerts from AWS to your DevOps team via Slack. marbot was one of the winners of the AWS Serverless Chatbot Competition in 2016. Today I want to show you how marbot works and what we learned so far.

Let’s start with the architecture diagram.

The diagram was created with Cloudcraft - Visualize your cloud architecture like a pro.

Architecture

The marbot API is provided by an API Gateway. We get most of your requests from:

Amazon SNS: Our users use SNS topics with an HTTPS subscription to transport alerts, notifications, and incidents from within their AWS accounts to marbot.
Non-AWS sources that make HTTPS calls like New Relic, Uptime Robot, or just curl clients.
Slack: marbot uses the Slack Events API to drive conversations with his users and Slack Buttons to allow users to acknowledge, close, or pass alerts.

The API Gateway forwards HTTP requests to one of our Lambda functions. All of them are implemented in Node.js and store their state in DynamoDB tables.

One special case is the Slack Button API. When you press a button in a Slack message, marbot has 3 seconds to respond to this message. To respond to a button press, marbot may need to make a bunch of calls to the Slack API.

Learnings

Decoupling the process

We learned that we miss the 2-second timeout very often by looking at our CloudWatch data. To not miss the 2-second timeout, we now only put a record into a Kinesis stream that contains all relevant data before we respond to the API request. Writing to Kinesis is a quick operation, and we haven’t seen 2-second timeouts since we switched to Kinesis streams.

As soon as possible we read the Kinesis stream and process the records within a Lambda function. Kinesis comes with its challenges. If you fail to process a record, the Lambda Kinesis integration will retry this record as long as the record is deleted from the stream. All the newer records will not be processed until the failed record is deleted or you fix the bug!

We also thought about using SQS, but:

there is no native SQS Lambda integration
we can not build one on our own that is serverless and responds within a second

So we decided to use Kinesis knowing that an error can stop our whole processing pipeline.

Resilient remote calls

HTTP requests are hard. A lot of things can go wrong. Two things that we learned early when talking to the Slack API:

Set timeouts: We use 3 seconds at the moment and think about reducing this to 2 seconds
Retry on failures like timeouts or 5XX responses.

Our Node.js implementation of Slack API calls relies on the requestretry package:

const requestretry = require('requestretry');
const AWSXRay = require('aws-xray-sdk');

function invokeSlack(method, qs, cb) {
  requestretry({
    method: 'GET',
    url: `https://slack.com/api/${method}`,
    qs: qs,
    json: true,
    maxAttempts: 3, // retry only 3 times
    retryDelay: 100, // wait 0.1 seconds between two retries
    timeout: 3000, // timeout after 3 seconds
    httpModules: {
      'http:': AWSXRay.captureHTTPs(require('http')), // enable X-Ray tracing for http calls
      'https:': AWSXRay.captureHTTPs(require('https')) // enable X-Ray tracing for https calls
    }
  }, function(err, res, body) { /* ... */ });
}

The following screenshot shows a X-Ray trace where the code retried Slack API calls because of the 3 seconds timeout.

X-Ray trace

Implementing timers on AWS

For every alert that arrives in marbot, we keep a timer. 5 minutes after the alert is received we check if someone acknowledged the alert. If not, we escalate the alert to another engineer or the whole team. We have decided to use SQS queues for that. If you send a message to an SQS queue, you can set a delay. Only after the delay, the message becomes visible in the queue. Exactly what we need! The only downside to this solution is that there is no native way to connect Lambda and SQS. But with a few lines of code, you can implement this on your own.

Keeping secrets secure

We use git to version our source code. To communicate with the Slack API, we need to store a secret that we use to authenticate with Slack. We keep those secrets in a JSON file that is added to git as well. But we encrypt the whole file with KMS before we put it into git with the AWS CLI:

aws kms encrypt --key-id XXX --plaintext fileb://config_plain.json --output text --query CiphertextBlob | base64 --decode > config.json

Make sure to put config_plain.json into your .gitignore file!

Outside of the Lambda handler code, we use this code snippet to decrypt the configuration:

const fs = require('fs');
const AWSXRay = require('aws-xray-sdk');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const kms = new AWS.KMS({apiVersion: '2014-11-01'});

const config = new Promise(function(resolve, reject) {
  fs.readFile(`config.json`, function(err, data) {
    if (err) {
      reject(err);
    } else {
      kms.decrypt({CiphertextBlob: data}, function(err, data) {
        if (err) {
          reject(err);
        } else {
          try {
            resolve(JSON.parse(new Buffer(data.Plaintext, 'base64')));
          } catch (err) {
            reject(err);
          }
        }
      });
    } 
  });
});

Inside the Lambda handler code, you can access the config like this:

config
  .then(function(c) {
    // do something
  })
  .catch(function(err) {
    // handle error
  });

Using this approach, you will only make one API call to KMS (for every Lambda runtime).

Getting insights

We use custom CloudWatch metrics to get insights into:

How many Slack teams installed marbot
Number of alerts and escalations created

We use a CloudWatch Dashboard to display those business metrics together with some technical metrics.

marbot dashboard

Deploying the infrastructure

Our pipeline for deploying marbot works like this:

Download dependencies (npm install)
Lint code
Run unit tests (we mock all external HTTP calls with nock
cloudformation package
cloudformation deploy to an integration stack
Run integration tests with newman
cloudformation deploy to a prod stack

Jenkins runs the pipeline. Since our code is hosted on BitBucket, we can not easily use CodePipeline at the moment.

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, attachmentAV, HyperEnv, and marbot.

Here are the contact options for feedback and questions.