Serverless Big Data pipeline on AWS

Andreas Wittig – 14 Jul 2016

Lambda is a powerful tool when integrating different services on AWS. During the last months, I’ve successfully used serverless architectures to build Big Data pipelines. And I’d like to share my learnings with you.

The benefits of serverless pipeline are:

  • No need to manage a fleet of EC2 instances.
  • Highly scalable.
  • Paid per execution.

A Big Data pipeline is moving data between data sources and data targets. Often called an ETL process (extract, transform, load) as well. The following figure describes a typical serverless Big Data pipeline:

Big Data Pipeline with AWS Lambda

Use Cases

Using Lambda to implement your Big Data pipeline is especially useful if you need to transform or filter data during moving from data source to data target.

Typical use cases:

  • Load CloudFront and ELB logs from S3, transform and filter data, insert into Elasticsearch cluster.
  • Load business reports from S3, transform and filter data, insert into Redshift.
  • Load event data from Kinesis stream, transform and filter data, store on S3 for further processing.

Other use cases are possible as well. Changed data (S3 and DynamoDB), external events, or a schedule (CloudWatch Event Rule) are able to trigger a Lambda function. A Lambda can access data sources and targets connected to the Internet or VPC.


Looking for a new challenge?

  • tecRacer

    Cloud Consultant

    tecRacer • Premier AWS Consulting Partner • Germany, Austria, Spain, and Switzerland
    AWS only Infrastructure as Code EC2 Containers Serverless
  • tecRacer

    Cloud Migration Specialist

    tecRacer • Premier AWS Consulting Partner • Germany, Austria, Spain, and Switzerland
    Lift&Shift Transformation EC2 RDS VPC

Seems like there almost no limits?

Limitations

Lambda is a powerful tool, but compared to an EC2 instance there are limitations as well. Limitations when building a serverless Big Data pipeline:

  • Maximum execution duration: 300 seconds
  • Maximum memory: 1536 MB
  • Ephemeral disk capacity: 512 MB

Real world example:

  1. Load CSV file from S3.
  2. Unzip data.
  3. Transform data.
  4. Zip data.
  5. Upload to S3.

About 800 MB of unzipped data. Implementing a Lambda function following the asynchronous model of Node.js is not possible as there is neither enough memory nor disk capacity to hold the unzipped data as well as the transformed data at once.

Solution: Data Streaming

Using streaming instead of linear execution allows you to extract, transform, and load data in chunks from the beginning to the end of the pipeline.

The following source code contains an example implementing a stream for the described scenario in Node.js:

  1. Load csv.tgz file from S3.
  2. Unzip data.
  3. Split at the end of the line.
  4. Transform data.
  5. Zip data.
  6. Upload file to S3.
var AWS = require("aws-sdk");
var zlib = require("zlib");
var split = require("split");
var transform = require("stream-transform");

var sourceBucket = "BUCKET_NAME";
var sourceKey = "KEY";
var targetBucket = "BUCKET_NAME";
var targetKey = "KEY";

var s3 = new AWS.S3();

var transformer = transform(function(record, callback) {
// TODO transform
callback(null, record);
});

var pipeline = s3.getObject({ // (1)
Bucket: sourceBucket,
Key: sourceKey
})
.createReadStream()
.pipe(zlib.createGunzip()) // (2)
.pipe(split()) // (3)
.pipe(transformer) // (4)
.pipe(zlib.createGzip()); // (5)

// (6)
s3.upload({"Bucket": targetBucket, "Key": targetKey, "Body": pipeline}, function(err) {
if (err) {
console.error(err);
}
});

This approach allows you to process data without hitting the memory or disk space limitations. Of course, the maximum execution duration of 300 seconds is still limiting the maximum throughput of your serverless data pipeline. If you are hitting the limit, you need to split your data into smaller chunks.

Become a cloudonaut supporter

Andreas Wittig

Andreas Wittig ( Email, Twitter, or LinkedIn )

We launched the cloudonaut blog in 2015. Since then, we have published 345 articles, 45 podcast episodes, and 37 videos. It's all free and means a lot of work in our spare time. We enjoy sharing our AWS knowledge with you.

Please support us

Have you learned something new by reading, listening, or watching our content? With your help, we can spend enough time to keep publishing great content in the future. Learn more

$
Amount must be a multriply of 5. E.g, 5, 10, 15.

Thanks to Alan Leech, Alex DeBrie, ANTHONY RAITI, Jaap-Jan Frans, Jason Yorty, Jeff Finley, Jens Gehring, jhoadley, Johannes Grumböck, John Culkin, Jonas Mellquist, Juraj Martinka, Kamil Oboril, Ken Snyder, Ross Mohan, Ross Mohan, sam onaga, Shawn Tolidano, Thorsten Hoeger, Todd Valentine, and all anonymous supporters for your help! We also want to thank all supporters who purchased a cloudonaut t-shirt.