Menu
Amazon Kinesis Streams
Developer Guide

Tutorial: Visualizing Web Traffic Using Amazon Kinesis Streams

This tutorial helps you get started using Amazon Kinesis Streams by providing an introduction to key Kinesis Streams constructs; specifically streams, data producers, and data consumers. The tutorial uses a sample application based upon a common use case of real-time data analytics, as introduced in What Is Amazon Kinesis Streams?.

The web application for this sample uses a simple JavaScript application to poll the DynamoDB table used to store the results of the Top-N analysis over a slide window. The application takes this data and creates a visualization of the results.

Kinesis Streams Data Visualization Sample Application

The data visualization sample application for this tutorial demonstrates how to use Kinesis Streams for real-time data ingestion and analysis. The sample application creates a data producer that puts simulated visitor counts from various URLs into a Kinesis stream. The stream durably stores these data records in the order they are received. The data consumer gets these records from the stream, and then calculates how many visitors originated from a particular URL. Finally, a simple web application polls the results in real time to provide a visualization of the calculations.

The sample application demonstrates the common stream processing use-case of performing a sliding window analysis over a 10-second period. The data displayed in the above visualization reflects the results of the sliding window analysis of the stream as a continuously updated graph. In addition, the data consumer performs Top-K analysis over the data stream to compute the top three referrers by count, which is displayed in the table immediately below the graph and updated every two seconds.

To get you started quickly, the sample application uses AWS CloudFormation. AWS CloudFormation allows you to create templates to describe the AWS resources and any associated dependencies or runtime parameters required to run your application. The sample application uses a template to create all the necessary resources quickly, including producer and consumer applications running on an Amazon EC2 instance and a table in Amazon DynamoDB to store the aggregate record counts.

Note

After the sample application starts, it incurs nominal charges for Kinesis Streams usage. Where possible, the sample application uses resources that are eligible for the AWS Free Tier. When you are finished with this tutorial, delete your AWS resources to stop incurring charges. For more information, see Step 3: Delete Sample Application.

Prerequisites

This tutorial helps you set up, run, and view the results of the Kinesis Streams data visualization sample application. To get started with the sample application, you first need to do the following:

  • Set up a computer with a working Internet connection.

  • Sign up for an AWS account.

  • Additionally, read through the introductory sections to gain a high-level understanding of streams, data producers, and data consumers.

Step 1: Start the Sample Application

Start the sample application using a AWS CloudFormation template provided by AWS. The sample application has a stream writer that randomly generates records and sends them to an Kinesis stream, a data consumer that counts the number of HTTPS requests to a resource, and a web application that displays the outputs of the stream processing data as a continuously updated graph.

To start the application

  1. Open the AWS CloudFormation template for this tutorial.

  2. On the Select Template page, the URL for the template is provided. Choose Next.

  3. On the Specify Details page, note that the default instance type is t2.micro. However, T2 instances require a VPC. If your AWS account does not have a default VPC in the us-east-1 region, you must change InstanceType another instance type, such as m3.medium. Choose Next.

  4. On the Options page, you can optionally type a tag key and tag value. This tag will be added to the resources created by the template, such as the EC2 instance. Choose Next.

  5. On the Review page, select I acknowledge that this template might cause AWS CloudFormation to create IAM resources, and then choose Create.

Initially, you should see a stack named KinesisDataVisSample with the status CREATE_IN_PROGRESS. The stack can take several minutes to create. When the status is CREATE_COMPLETE, continue to the next step. If the status does not update, refresh the page.

Step 2: View the Components of the Sample Application

Kinesis Stream

A stream has the ability to ingest data in real-time from a large number of producers, durably store the data, and provide the data to multiple consumers. A stream represents an ordered sequence of data records. When you create a stream, you must specify a stream name and a shard count. A stream consists of one or more shards; each shard is a group of data records.

AWS CloudFormation automatically creates the stream for the sample application. This section of the AWS CloudFormation template shows the parameters used in the CreateStream operation.

To view the stack details

  1. Select the KinesisDataVisSample stack.

  2. On the Outputs tab, choose the link in URL. The form of the URL should be similar to the following: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com.

  3. It takes about 10 minutes for the application stack to be created and for meaningful data to show up in the data analysis graph. The real-time data analysis graph appears on a separate page, titled Kinesis Streams Data Visualization Sample. It displays the number of requests sent by the referring URL over a 10 second span, and the chart is updated every 1 second. The span of the graph is the last 2 minutes.

    
							Data visualization for the sample application

To view the stream details

  1. Open the Kinesis console at https://console.aws.amazon.com/kinesis.

  2. Select the stream whose name has the following form: KinesisDataVisSampleApp-KinesisStream-[randomString].

  3. Choose the name of the stream to view the stream details.

  4. The graphs display the activity of the data producer putting records into the stream and the data consumer getting data from the stream.

    
							Stream monitoring graphs for the sample application

Data Producer

A data producer submits data records to the Kinesis stream. To put data into the stream, producers call the PutRecord operation on a stream.

Each PutRecord call requires the stream name, partition key, and the data record that the producer is adding to the stream. The stream name determines the stream in which the record will reside. The partition key is used to determine the shard in the stream that the data record will be added to.

Which partition key you use depends on your application logic. The number of partition keys should be much greater than the number of shards. in most cases. A high number of partition keys relative to shards allows the stream to distribute data records evenly across the shards in your stream.

The data producer uses six popular URLs as a partition key for each record put into the two-shard stream. These URLs represent simulated page visits. Rows 99-136 of the HttpReferrerKinesisPutter code send the data to Kinesis Streams. The three required parameters are set before calling PutRecord. The partition key is set using pair.getResource, which randomly selects one of the six URLs created in rows 85-92 of the HttpReferrerStreamWriter code.

A data producer can be anything that puts data to Kinesis Streams, such as an EC2 instance, client browser, or mobile device. The sample application uses an EC2 instance for its data producer as well as its data consumer; whereas, most real-life scenarios would have separate EC2 instances for each component of the application. You can view EC2 instance data from the sample application by following the instructions below.

To view the instance data in the console

  1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. On the navigation pane, choose Instances.

  3. Select the instance created for the sample application. If you aren't sure which instance this is, it has a security group with a name that starts with KinesisDataVisSample.

  4. On the Monitoring tab, you'll see the resource usage of the sample application's data producer and data consumer.

Data Consumer

Data consumers retrieve and process data records from shards in a Kinesis stream. Each consumer reads data from a particular shard. Consumers retrieve data from a shard using the GetShardIterator and GetRecords operations.

A shard iterator represents the position of the stream and shard from which the consumer will read. A consumer gets a shard iterator when it starts reading from a stream or changes the position from which it reads records from a stream. To get a shard iterator, you must provide a stream name, shard ID, and shard iterator type. The shard iterator type allows the consumer to specify where in the stream it would like to start reading from, such as from the start of the stream where the data is arriving in real-time. The stream returns the records in a batch, whose size you can control using the optional limit parameter.

The data consumer creates a table in DynamoDB to maintain state information (such as checkpoints and worker-shard mapping) for the application. Each application has its own DynamoDB table.

The data consumer counts visitor requests from each particular URL over the last two seconds. This type of real-time application employs Top-N analysis over a sliding window. In this case, the Top-N are the top three pages by visitor requests and the sliding window is two seconds. This is a common processing pattern that is demonstrative of real-world data analyses using Kinesis Streams. The results of this calculation are persisted to a DynamoDB table.

To view the Amazon DynamoDB tables

  1. Open the DynamoDB console at https://console.aws.amazon.com/dynamodb/.

  2. On the navigation pane, select Tables.

  3. There are two tables created by the sample application:

    • KinesisDataVisSampleApp-KCLDynamoDBTable-[randomString]—Manages state information.

    • KinesisDataVisSampleApp-CountsDynamoDBTable-[randomString]—Persists the results of the Top-N analysis over a sliding window.

  4. Select the KinesisDataVisSampleApp-KCLDynamoDBTable-[randomString] table. There are two entries in the table, indicating the particular shard (leaseKey), position in the stream (checkpoint), and the application reading the data (leaseOwner).

  5. Select the KinesisDataVisSampleApp-CountsDynamoDBTable-[randomString] table. You can see the aggregated visitor counts (referrerCounts) that the data consumer calculates as part of the sliding window analysis.

Kinesis Client Library (KCL)

Consumer applications can use the Kinesis Client Library (KCL) to simplify parallel processing of the stream. The KCL takes care of many of the complex tasks associated with distributed computing, such as load-balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to resharding. The KCL enables you to focus on writing record processing logic.

The data consumer provides the KCL with position of the stream that it wants to read from, in this case specifying the latest possible data from the beginning of the stream. The library uses this to call GetShardIterator on behalf of the consumer. The consumer component also provides the client library with what to do with the records that are processed using an important KCL interface called IRecordProcessor. The KCL calls GetRecords on behalf of the consumer and then processes those records as specified by IRecordProcessor.

  • Rows 92-98 of the HttpReferrerCounterApplication sample code configure the KCL. This sets up the library with its initial configuration, such as the setting the position of the stream in which to read data.

  • Rows 104-108 of the HttpReferrerCounterApplication sample code inform the KCL of what logic to use when processing records using an important client library component, IRecordProcessor.

  • Rows 186-203 of the CountingRecordProcessor sample code include the counting logic for the Top-N analysis using IRecordProcessor.

Step 3: Delete Sample Application

The sample application creates two shards, which incur shard usage charges while the application runs. To ensure that your AWS account does not continue to be billed, delete your AWS CloudFormation stack after you finish with the sample application.

To delete application resources

  1. Open the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation/.

  2. Select the stack.

  3. Choose Actions, Delete Stack

  4. When prompted for confirmation, choose Yes, Delete.

The status changes to DELETE_IN_PROGRESS while AWS CloudFormation cleans up the resources associated with the sample application. When AWS CloudFormation is finished cleaning up the resources, it removes the stack from the list.

Step 4: Next Steps