Overview - Real-Time Analytics with Spark Streaming

Overview

Many Amazon Web Services (AWS) customers use batch data reports to gain strategic insight into long-term business trends, and a growing number of customers also require streaming data to obtain actionable insights from their data in real time. Batch data is collected over a period of time and processed in batches, and this data can provide snapshots of trends that shape decision-making. Streaming data is generated continuously from thousands of data sources and it can help companies proactively respond to changing conditions.

A lambda architecture (not to be confused with the AWS Lambda service) is one way to implement real-time and batch data processing in a single framework. The lambda architecture divides processing into three layers: the batch layer in which new data is appended to the master data set and stored as batch views, the serving layer in which batch views are indexed, and the speed layer in which the real-time data views are produced, continuously updated, and stored for read/write operations.

AWS provides many of the building blocks required to build a secure, flexible, cost-effective lambda architecture in the cloud. (For more information on implementing a lambda architecture on AWS, see Lambda Architecture for Batch and Real-Time Processing on AWS.) These include Amazon Kinesis Data Streams, a platform for processing terabytes of streaming data, Amazon EMR, a service that distributes and processes data across dynamically scalable Amazon Elastic Compute Cloud (Amazon EC2) instances, and Amazon Simple Storage Service (Amazon S3), a secure and durable object store. Customers can combine these AWS services with Apache Spark Streaming, for fault-tolerant stream processing of live-data streams, and Spark SQL, which allows Spark code to execute relational queries, to build a single architecture to process real-time and batch data.

The Real-Time Analytics with Spark Streaming solution is an AWS-provided reference implementation that automatically provisions and configures the AWS services necessary to start processing real-time and batch data in minutes. The solution is designed to work with customers’ Spark Streaming applications, and also includes a demo application and data producer to create an example environment. The solution also leverages Apache Zeppelin, a web-based notebook for interactive data analytics, to enable customers to visualize both their real-time and batch data.

Cost

You are responsible for the cost of the AWS services used while running this solution. As of the date of publication, the cost for running this solution with default settings in the US East (N. Virginia) Region is listed in Table 1. This includes charges for Amazon Kinesis Data Streams, Amazon EC2, and Amazon EMR. Prices are subject to change. For full details, see the pricing webpage for each AWS service you will be using in this solution

AWS Service Resource Count and Type Total Cost/Hour
Amazon Kinesis Data Streams 2 shards $0.03
Amazon EC2 1 - t2.medium instance

(2 if using the demo template)

$0.05

$0.10

NAT Gateways 2 (one per availability zone) $0.09
Amazon EMR 3 - r3.xlarge instances $0.76
Note

The solution creates a customer master key (CMK) in AWS Key Management Service (AWS KMS) that is used to encrypt data at rest. Rotation is automatically enabled, so each new key version raises the cost of the CMK by $1/month (rotation happens once a year).

This pricing does not reflect variable charges incurred from Amazon S3, Amazon CloudWatch, AWS KMS, data transfer fees, or the cost of Amazon DynamoDB. For full details, see the pricing webpage for each AWS service you will be using in this solution.

Apache Spark Streaming, Apache Spark SQL, and Apache Zeppelin are open source. There is no additional cost to use these tools.