Introducing the real-time data store: Kinesis Streams
Kinesis is all about real-time data:
- Kinesis Streams are a temporary store for real-time data. Append new events or read all events in order.
- Kinesis Analytics helps you to analyze data in real-time.
- Kinesis Firehose ingests real-time data into data stores like S3, Elasticsearch or Redshift for batch analytics.
This blog post will introduce Kinesis Streams as the foundation of real-time data processing on AWS.
Imagine a timeline where you can add events at the current time, and you will have a good mental model of a Kinesis stream. The following animation will demonstrate the idea.
Arriving data is symbolized as yellow circles and called a Record. The current time is generated by the blue Kinesis Stream System. You can only insert records at the current time. You can not add records in the past or the future. This implies that records are ordered by time, inside a stream.
If you want to read records from a stream you have three options:
- read from the beginning of the stream to the present
- read from a certain record (identified by a sequence number) to the present
- read from a certain time to the present
Depending on your use case you are interested in historical records or not. A Kinesis Stream supports both cases. But there is one limitation: Kinesis Streams can keep records only 1-7 days. After that, records are deleted. Kinesis is not a long term storage.
A single Kinesis Stream is composed of Shards. A single shard supports:
- 5 read operations per second (you read in batches)
- maximum total data read rate of 2 MB per second
- insertion of up to 1,000 records per second
- maximum total data write rate of 1 MB per second
Records are assigned to shards based on a Partition Key. Records with the same partition key are always assigned to the same shard. The following animation will demonstrate the idea using two shards and representing partition keys with colors.
It’s important to know that there is no total order of records in the stream. The order is only guaranteed for records of the same shard. So if you want to track actions by many users, it makes sense to select the user id as the partition key. Otherwise, you will not be able to retrieve the records (actions) of a user in order.
Kinesis Streams make it very simple to insert records. The application that inserts records is also called the Producer.
When you insert records, you don’t care about shards. You just provide the name of the stream, a partition key, and the payload. A simple Node.js example follows:
var kinesis = new AWS.Kinesis();
The application that reads records is also called the Consumer.
Reading records is more tricky, because you can only read from a shard, not from a stream.
You need to create a Shard Iterator to read from a shard. The shard iterator contains the information of the next record to read. So when you read records, you will get the records and a new shard iterator that you can use to read the next records from the shard. A simple Node.js example follows:
var kinesis = new AWS.Kinesis();
That sounds pretty complicated, right? And that’s why most people use either the Kinesis Client Library or a Lambda function to read from Kinesis Streams. If possible, I recommend to use a Lambda function and connect it the Kinesis stream.
Kinesis Streams store real-time data for up to 7 days. A stream is composed of shards. Only inside a shard, there is an order of the records. When you insert records you can provide a partition key to group records together. When you read records, you read from a shard, not the whole stream. To keep track of the next record to read shard iterators are used.
Kinesis Streams is a managed service that is fault tolerant. It scales based on the number of shards you provision. You can change the number of shards at any time.
The pricing model is based on shard hours and insert record operations.