Reference Architecture - Analytics Lens

Reference Architecture

Figure 5: Lambda Architecture

Note that this architecture does not use any Amazon EC2 instances for running clusters or distributed streaming or processing frameworks. In the batch layer, AWS Glue is used for serverless ETL and batch processing using Spark Jobs. Glue can be used to extract data from the Live Database, and for doing the Batch Layer processing for combining the Live Database data with the streaming data for complex pre-processed views. In the Speed Layer, streaming data is read in real time from Amazon Kinesis Data Firehose by Amazon Kinesis Data Analytics, where you can join the streaming data from a reference dataset in Amazon S3 (extracted from the Live Database). Amazon Kinesis Data Analytics is used here to join the data in real time, filter data or run machine learning anomaly detection algorithms, and persist the results in Amazon S3. Amazon S3 is used as the Serving Layer, and Amazon Athena and Amazon QuickSight are used to query the various processed data.

  1. Streaming data Producers: Data generated continuously may generate terabytes per day, and is collected and sent to Kinesis Data Streams or Kinesis Data Firehose.

  2. Batch layer: Using AWS Glue, you analyze the raw data from Amazon S3 in batch-oriented fashion to look at the datasets over time against the historical data, and store results back in Amazon S3.

  3. Speed layer: Using Amazon Kinesis Data Analytics, you analyze and filter the data to detect abnormalities in real time.

  4. Serving layer: Raw data, preprocessed views, and real-time filtered data is available in Amazon S3 for direct querying in the Serving layer.

  5. Amazon Athena provides serverless SQL queries on top of Amazon S3 to power visualizations, dashboards, and ad hoc queries.

  6. Finally, use Amazon Athena and Amazon QuickSight together to query and visualize the data and build a dashboard that can be shared with other users of Amazon QuickSight in your organization.