Reference Architecture - Analytics Lens

Reference Architecture

Figure 4: Streaming Data Analytics Reference Architecture

  1. Data Producers: Multiple producers generate data continuously that might amount to terabytes of data per day. Producers can use Kinesis Agent, which is a standalone Java software application, to collect and send data to Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose. The agent continuously monitors a set of files and sends new data to your delivery stream. The agent handles file rotation, checkpointing, and retry upon failures and delivers all of your data in a reliable, timely, and simple manner. If you use an operating system that is not compatible with the Kinesis agent, you could, alternatively, use Kinesis Producer Library (KPL) to achieve high write throughput to a Kinesis data stream. The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions. Producers can also use Kafka Producers to send messages to a Kafka cluster.

  2. Streaming Ingest: Kinesis Data Streams and Kinesis Data Firehose can ingest and process large streams of data records. Choose Kinesis Data Streams for rapid and continuous data intake and aggregation as the response time for the data intake and processing is in real time. Use Kinesis Data Firehose, a fully managed streaming service, to transform and deliver real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk.

    If you use Apache Kafka, you can deploy the cluster on Amazon EC2 to provide a high performance, scalable solution for ingesting streaming data. AWS offers many different instance types and storage option combinations for Kafka deployments. Alternatively, you can use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to build and run production applications on Apache Kafka without needing Apache Kafka infrastructure management expertise.

  3. Stream Processing: Real-time data streams can be processed sequentially and incrementally on a record-by-record basis over sliding time windows using a variety of services.

    For example, with Kinesis Data Analytics, you can process and analyze streaming data using standard SQL in a serverless way. The service enables you to quickly author and run SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics. The SQL query results can then be configured to fan out to external destinations such as Kinesis Data Firehose or Kinesis Data Streams.

    For data ingested using Kinesis Data Streams, you can develop a consumer application using the Kinesis Client Library (KCL). Although you can use the Kinesis Data Streams API to get data from a Kinesis data stream, we recommend using the design patterns and code for consumer applications provided by the KCL. You can use Kinesis Data Streams as a fully managed event source for Lambda functions.

    Another popular way to process streaming data is with Kinesis Data Firehose. Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations. You can enable Kinesis Data Firehose data transformation when you create your delivery stream.

    If you work in a Hadoop environment, you can process streaming data using multiple options—Spark Streaming, Apache Flink or Structured Streaming. Customers can use streaming ingest solutions, such as Kinesis Data Streams, Amazon MSK, or Apache Kafka running on Amazon EC2 with Apache Spark Streaming, for fault-tolerant stream processing of live-data streams, and Spark SQL, which allows Spark code to execute relational queries, to build a single architecture to process real-time and batch data.

    Apache Flink is a streaming dataflow engine that you can use to run real-time stream processing on high-throughput data sources. Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications. Additionally, Flink has connectors for third-party data sources, such as Amazon Kinesis, Apache Kafka, Amazon ES, Twitter Streaming API, and Cassandra. Amazon EMR supports Flink as a YARN application so that you can manage resources along with other applications within a cluster. Flink-on-YARN allows you to submit transient Flink jobs, or you can create a long-running cluster that accepts multiple jobs and allocates resources according to the overall YARN reservation.

    With Apache Kafka as the streaming source, you can also use Spark Structured Streaming on Amazon EMR. Structured Streaming is a fault-tolerant stream processing engine built over Spark SQL.

  4. Alerts, messaging fan-out and storage: Processed streaming data can be fed to a real-time predictive analytics system to derive inferences which in turn can be used for alerting using Amazon SNS. Messages that are processed can also be fanned out to Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, or AWS Lambda to generate new streams for further processing. Further down in the pipeline, applications can perform simple aggregation based on streaming data and emit processed data into Amazon S3. Other patterns include storing the real-time data in Amazon Redshift for complex analytics or DynamoDB for querying events or Amazon ES for full text search.

  5. Downstream analytics: Data processed on the fly using streaming technologies can be persisted to serve real-time analytics, machine learning, alerts, and additional custom actions.