Amazon Kinesis Streams Key Concepts
As you get started with Amazon Kinesis Streams, you'll benefit from understanding its architecture and terminology.
Streams High-level Architecture
The following diagram illustrates the high-level architecture of Streams. The producers continually push data to Streams and the consumers process the data in real time. Consumers can store their results using an AWS service such as Amazon DynamoDB, Amazon Redshift, or Amazon S3.
Amazon Kinesis Streams
An Amazon Kinesis stream is an ordered sequence of data records. Each record in the stream has a sequence number that is assigned by Streams. The data records in the stream are distributed into shards.
A data record is the unit of data stored in an Amazon Kinesis stream. Data records are composed of a sequence number, partition key, and data blob, which is an immutable sequence of bytes. Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be up to 1 MB.
The length of time data records are accessible after they are added to the stream. A stream’s retention period is set to a default of 24 hours after creation. You can increase the retention period up to 168 hours (7 days) using the IncreaseStreamRetentionPeriod operation, and decrease the retention period down to a minimum of 24 hours using the DecreaseStreamRetentionPeriod operation. Additional charges apply for streams with a retention period set above 24 hours. For more information, see Amazon Kinesis Streams Pricing.
Producers put records into Amazon Kinesis Streams. For example, a web server sending log data to a stream is a producer.
Consumers get records from Amazon Kinesis Streams and process them. These consumers are known as Amazon Kinesis Streams Applications.
Amazon Kinesis Streams Applications
An Amazon Kinesis Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances.
You can develop an Amazon Kinesis Streams application using the Amazon Kinesis Client Library or using the Streams API.
The output of an Amazon Kinesis Streams application may be input for another stream, enabling you to create complex topologies that process data in real time. An application can also send data to a variety of other AWS services. There can be multiple applications for one stream, and each application can consume data from the stream independently and concurrently.
A shard is a uniquely identified group of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.
If your data rate increases, then you just add more shards to increase the size of your stream. Similarly, you can remove shards if the data rate decreases.
A partition key is used to group data by shard within a stream. The Streams service segregates the data records belonging to a stream into multiple shards, using the partition key associated with each data record to determine which shard a given data record belongs to. Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards. A partition key is specified by the applications putting the data into a stream.
Each data record has a unique sequence number. The sequence number is assigned
by Streams after you write to the stream with
client.putRecord. Sequence numbers for the same partition key
generally increase over time; the longer the time period between write requests,
the larger the sequence numbers become.
Sequence numbers cannot be used as indexes to sets of data within the same stream. To logically separate sets of data, use partition keys or create a separate stream for each data set.
Amazon Kinesis Client Library
The Amazon Kinesis Client Library is compiled into your application to enable fault-tolerant consumption of data from the stream. The Amazon Kinesis Client Library ensures that for every shard there is a record processor running and processing that shard. The library also simplifies reading data from the stream. The Amazon Kinesis Client Library uses an Amazon DynamoDB table to store control data. It creates one table per application that is processing data.
The name of an Amazon Kinesis Streams application identifies the application. Each of your applications must have a unique name that is scoped to the AWS account and region used by the application. This name is used as a name for the control table in Amazon DynamoDB and the namespace for Amazon CloudWatch metrics.