Architecture overview - Discovering Hot Topics using Machine Learning

Architecture overview

Deploying this solution with the default parameters builds the following environment in the AWS Cloud.

Figure 1: Discovering Hot Topics using Machine Learning solution architecture

The AWS CloudFormation template automatically deploys AWS Lambda functions, Amazon Simple Storage Service (Amazon S3) buckets, Amazon Kinesis Data Streams, Amazon Simple Queue Service Dead Letter Queue (DLQ), Amazon Kinesis Data Firehose, AWS Step Functions workflows, AWS Glue tables, and Amazon QuickSight resources in your account.

The solution architecture includes the following key components and workflows:

  1. Ingestion – Social media and RSS feed ingestion and management using Lambda functions, Amazon DynamoDB, and Amazon CloudWatch Event Scheduler.

  2. Data Stream – The data is buffered through Amazon Kinesis Data Streams to provide resiliency and throttle incoming requests. The Data Streams have a configured DLQ to catch any errors in processing feeds.

  3. Workflow – Consumer (Lambda function) of the Data Streams initiates a Step Functions workflow that orchestrates Amazon Machine Learning capabilities including: Amazon Translate, Amazon Comprehend, and Amazon Rekognition.

  4. Integration – The inference data integrates with the storage components through an event-driven architecture using Amazon EventBridge. EventBridge allows further customization to add additional targets by configuring rules.

  5. Inference – AWS machine learning capabilities through Amazon Translate, Amazon Comprehend, and Amazon Rekognition.

  6. Storage and Visualization – A combination of Amazon Kinesis Data Firehose, Amazon S3 buckets, AWS Glue tables, Amazon Athena, and Amazon QuickSight.

These components are built using the AWS Well-Architected Framework, and the AWS Well-Architected Pillars of Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization—ensuring secure, high-performing, resilient, and efficient infrastructure.

Operational Excellence

This solution's AWS CloudFormation template, was built with the AWS Cloud Development Kit (CDK). The template was built without hard-coding resource names or Regions, which ensures that it can be replicated in any Region where the services required by the solution are available.

Amazon CloudWatch Logs for Lambda functions and monitoring features provided by services such as Step Functions, Kinesis Data Streams, and Kinesis Data Firehose provide observability into the infrastructure.

Security

This solution implements encryption-at-rest and encryption-in-transit. For encryption-at-rest, Amazon S3 buckets and DynamoDB tables have SSE-S3 AWS managed encryption activated. For encryption-in-transit, all endpoints for AWS Cloud services use HTTPS endpoints, and Kinesis Data Streams have AWS managed encryption activated.

The ingestion component of this solution requires Twitter credentials. The bearer token required to authenticate with Twitter APIs is read in memory through an AWS Systems Manager Parameter Store.

Additionally, the interactions between the services within this solution are controlled by IAM role policies. The policies are configured on the principle of least privilege access.

Reliability

This solution is based on AWS serverless artificial intelligence (AI), compute, and storage services: Lambda, Amazon Rekognition, Amazon Translate, Amazon Comprehend, and DynamoDB to ensure high availability and reliability. The workflow tasks are backed through an SQS based asynchronous call back service integration pattern to mitigate throttling errors from burst workloads. The solution also uses Dead Letter Queue (DLQ) as an option to route failed events and allow you to troubleshoot and resolve underlying issues.

Performance Efficiency

This solution uses Lambda functions to provide concurrency and scaling, which ensures efficient use of compute resources.

It uses DynamoDB to achieve higher throughput with sub-millisecond latency, and resiliency through automatic scaling and on-demand scaling. Kinesis Data Streams provides data buffering that makes the architecture resilient to data bursts and spikes. Data is stored in columnar format and partition to optimize query performance for reporting.

Cost Optimization

The choice of serverless components in compute, storage, and AI services ensures that you are only charged for the services you use.

Using the DynamoDB on-demand capacity mode provides customers with the option to better understand their workloads and update their Read Capacity Unit (RCU) and Write Capacity Unit (WCU) based on each individual workload.

A Lambda function is invoked nightly to create partitions for the AWS Glue tables, which eliminates the need for AWS Glue Crawler to scan the entire dataset, and saves on cost.