Solution overview - Streaming Data Solution for Amazon MSK

Solution overview

Publication date: November 2020 (last update: February 2024)

The Streaming Data Solution for Amazon MSK allows you to capture, store, process, and deliver real-time streaming data. By automatically configuring the included AWS services, this solution helps you address real-time streaming use cases, for example:

  • Capture high volume application log files

  • Analyze website clickstreams

  • Process database event streams

  • Track financial transactions

  • Aggregate social media feeds

  • Collect IT log files

  • Continuously deliver to a data lake

This solution helps accelerate your development lifecycle by minimizing or eliminating the need to model and provision resources using AWS CloudFormation, set up preconfigured Amazon CloudWatch alarms set to recommended thresholds, dashboards, and logging, and manually implement streaming data best practices. This solution is data and logic agnostic, meaning that you can start with boilerplate code and then customize it to your needs.

The solution uses templates where data flows through producers, streaming storage, consumers, and destinations. Producers continuously generate data and send it to streaming storage where it is durably captured and made available for processing by a data consumer. Data consumers process the data and then send it to a destination.

To support multiple use cases and business needs, this solution offers four AWS CloudFormation templates. You can use this solution to test new service combinations as the basis for your production environment, and to improve existing applications.

  1. Option 1 creates a standalone Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster following best practices, such as sending broker logs to Amazon CloudWatch Logs; encryption at rest; encryption in transit among the broker nodes; and open monitoring with Prometheus activated.

  2. Option 2 adds an AWS Lambda function that processes records in an existing Apache Kafka topic as a starting example that you can modify and customize. The Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function.

  3. Option 3 is intended for use cases when you must back up messages from a topic in Amazon MSK (for instance, to replay or analyze them). It uses Amazon Data Firehose (which compresses and encrypts, minimizing the amount of storage used at the destination and increasing security) and Amazon Simple Storage Service (Amazon S3).

  4. Option 4 showcases how to read data from an existing topic in Amazon MSK using Apache Flink, which provides exactly-once processing. It uses Amazon Managed Service for Apache Flink (a fully managed service that handles core capabilities like provisioning compute resources, parallel computation, automatic scaling, and application backups) and Amazon Simple Storage Service (Amazon S3).

All templates are configured to apply best practices to monitor functionality using dashboards and alarms, and to secure data.

This implementation guide describes architectural considerations and configuration steps for deploying the Streaming Data Solution for Amazon MSK in the Amazon Web Services (AWS) Cloud. It includes links to AWS CloudFormation templates that launch and configure the AWS services required to deploy this solution using AWS best practices for security and availability.

The guide is intended for IT architects, developers, and DevOps professionals who want to get started quickly with the core streaming services available in the AWS Cloud.

This solution is a demo. We do not recommend using this to handle regulated data such as PII, HIPAA, and GPDR when deployed in production.