Stream Processing - Serverless Applications Lens

This whitepaper is in the process of being updated.

Stream Processing

Ingesting and processing real-time streaming data requires scalability and low latency to support a variety of applications such as activity tracking, transaction order processing, click-stream analysis, data cleansing, metrics generation, log filtering, indexing, social media analysis, and IoT device data telemetry and metering. These applications are often spiky and process thousands of events per second.

Using AWS Lambda and Amazon Kinesis, you can build a serverless stream process that automatically scales without provisioning or managing servers. Data processed by AWS Lambda can be stored in DynamoDB and analyzed later.

Characteristics:

  • You want to create a complete serverless architecture without managing any instance or server for processing streaming data.

  • You want to use the Amazon Kinesis Producer Library (KPL) to take care of data ingestion from a data producer-perspective.

Reference Architecture

Here we are presenting a scenario for common stream processing, which is a reference architecture for analyzing social media data.

Figure 5: Reference architecture for stream processing

  1. Data producers use the Amazon Kinesis Producer Library (KPL) to send social media streaming data to a Kinesis stream. Amazon Kinesis Agent and custom data producers that leverage the Kinesis API can also be used.

  2. An Amazon Kinesis stream collects, processes, and analyzes real-time streaming data produced by data producers. Data ingested into the stream can be processed by a consumer, which, in this case, is Lambda

  3. AWS Lambda acts as a consumer of the stream that receives an array of the ingested data as a single event/invocation. Further processing is carried out by the Lambda function. The transformed data is then stored in a persistent storage, which, in this case, is DynamoDB.

  4. Amazon DynamoDB provides a fast and flexible NoSQL database service including triggers that can integrate with AWS Lambda to make such data available elsewhere.

  5. Business users leverage a reporting interface on top of DynamoDB to gather insights out of social media trend data

Configuration notes:

  • Follow best practices when re-sharding Kinesis streams to accommodate a higher ingestion rate. Concurrency for stream processing is dictated by the number of shards and by the parallelization factor. Therefore, adjust it according to your throughput requirements.

  • Consider reviewing the Streaming Data Solutions whitepaper for batch processing, analytics on streams, and other useful patterns.

  • When not using KPL, make certain to take into account partial failures for non-atomic operations, such as PutRecords, since the Kinesis API returns both successfully and unsuccessfully processed records upon ingestion time.

  • Duplicated records may occur, and you must leverage both retries and idempotency within your application for both consumers and producers.

  • Consider using Kinesis Data Firehose over Lambda when ingested data needs to be continuously loaded into Amazon S3, Amazon Redshift, or Amazon ES.

  • Consider using Kinesis Data Analytics over Lambda when standard SQL could be used to query streaming data, and load only its results into Amazon S3, Amazon Redshift, Amazon ES, or Kinesis Streams.

  • Follow best practices for AWS Lambda stream-based invocation since that covers the effects on batch size, concurrency per shard, and monitoring stream processing in more detail.

  • Use Lambda maximum retry attempts, maximum record age, bisect batch on function error, and on-failure destination error controls to build more resilient stream processing applications.