Architecture Overview - Real-Time Analytics with Spark Streaming

Architecture Overview

Deploying this solution with the default parameters builds the following environment in the AWS Cloud.


        Real-Time Analytics with Spark Streaming default architecture

Figure 1: Real-Time Analytics with Spark Streaming default architecture

The AWS CloudFormation template deploys Amazon Kinesis Data Streams which includes Amazon DynamoDB for checkpointing, an Amazon Virtual Private Cloud (Amazon VPC) network with one public and one private subnet, a NAT gateway, a bastion host, an Amazon EMR cluster, and a VPC endpoint to an Amazon S3 bucket.

Amazon Kinesis Data Streams collects data from data sources and sends it through a NAT gateway to the Amazon EMR cluster. Amazon Kinesis Data Streams also includes the Kinesis Client Library (KCL), a pre-built library that helps you easily build Kinesis applications for reading and processing data from an Kinesis stream. The KCL uses a unique Amazon DynamoDB table to keep track of the application's state. Because the KCL uses the name of the Amazon Kinesis Data Streams application to create the name of the table, each application name must be unique.

The private subnet contains an Amazon EMR cluster with Apache Zeppelin. The public subnet contains a NAT gateway to connect Amazon Kinesis Data Streams to the Amazon EMR cluster, and a bastion host that provides SSH access to the Amazon EMR cluster.

The Real-Time Analytics solution is designed to allow you to use your own application, but it also includes a demo application that you can deploy for testing purposes. For more information, see Appendix A.

After the Spark Streaming application processes the data, it stores the data in an Amazon S3 bucket.