Aggregate data in Amazon DynamoDB for ML forecasting in Athena
Created by Sachin Doshi (AWS) and Peter Molnar (AWS)
Code repository: Use ML predictions over Amazon DynamoDB data with Amazon Athena ML | Environment: Production | Technologies: Machine learning & AI; Databases; Serverless |
Workload: Open-source | AWS services: Amazon Athena; Amazon DynamoDB; AWS Lambda; Amazon SageMaker; Amazon QuickSight |
Summary
This pattern shows you how to build complex aggregations of Internet of Things (IoT) data in an Amazon DynamoDB table by using Amazon Athena. You also learn how to enrich the data with machine learning (ML) inference by using Amazon SageMaker and how to query geospatial data by using Athena. You can use this pattern as the basis for creating an ML forecasting solution that meets your organization's requirements.
For demonstration purposes, this pattern uses an example scenario of a business that's operating a scooter rideshare and wants to predict the optimal number of scooters that must be deployed for customers in different urban neighborhoods. The business uses a pre-trained ML model that predicts customer demand for the next hour based on the past four hours. The scenario uses a public dataset from the Office of Civic Innovation & Technology
Prerequisites and limitations
An active AWS account
Permissions to create an AWS CloudFormation stack with AWS Identity and Access Management (IAM) roles for the following:
Amazon Simple Storage Service (Amazon S3) bucket
Athena
DynamoDB
SageMaker
AWS Lambda
Architecture
Technology stack
Amazon QuickSight
Amazon S3
Athena
DynamoDB
Lambda
SageMaker
Target architecture
The following diagram shows an architecture for building complex aggregations of data in DynamoDB by using the querying capabilities of Athena, a Lambda function, Amazon S3 storage, a SageMaker endpoint, and a QuickSight dashboard.
The diagram shows the following workflow:
A DynamoDB table ingests IoT data that's transmitted from a fleet of scooters.
A Lambda function loads the DynamoDB table with the ingested data.
An Athena query creates a new DynamoDB table for the geospatial data that represents the urban neighborhoods.
The query location is saved in an S3 bucket.
An Athena function queries the ML inference from the SageMaker endpoint that hosts the pre-trained ML model.
Athena queries data directly from the DynamoDB tables and aggregates the data for analysis.
A user views the output of the analyzed data in a QuickSight dashboard.
Tools
AWS services
Amazon Athena is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast, predictable, and scalable performance.
Amazon SageMaker is a managed ML service that helps you build and train ML models and then deploy them into a production-ready hosted environment.
Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
Amazon QuickSight is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard.
AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
Code repository
The code for this pattern is available in the GitHub Use ML predictions over Amazon DynamoDB data with Amazon Athena ML
A DynamoDB table
A Lambda function to load the table with relevant data
A SageMaker endpoint for inference requests, with the pre-trained XGBoost model that's stored in Amazon S3
An Athena workgroup named
V2EngineWorkGroup
Named Athena queries to look up the geospatial shapefiles and predict scooter demand
A prebuilt Amazon Athena DynamoDB connector that enables Athena to communicate with DynamoDB and uses AWS Serverless Application Model (AWS SAM)
to build the application in reference to the DynamoDB connector
Epics
Task | Description | Skills required |
---|---|---|
Download the dataset and resources. |
| App developer, Data scientist |
Task | Description | Skills required |
---|---|---|
Create a CloudFormation stack. |
Note: It can take 15–20 minutes for the CloudFormation stack to create these resources. | AWS DevOps |
Verify the CloudFormation deployment. | To verify that the sample data from the CloudFormation template is loaded into DynamoDB, do the following:
| App developer |
Task | Description | Skills required |
---|---|---|
Create an Athena table with geospatial data. | To load the geolocation files into Athena, do the following:
The query creates a new table for the geospatial data that represents the urban neighborhoods. The data table is created from GIS shapefiles. The For the Python code to process shapefiles and produce this table, see Geo-Spatial processing of GIS shapefiles with Amazon Athena | Data engineer |
Task | Description | Skills required |
---|---|---|
Declare a function in Athena to query SageMaker. |
| Data scientist, Data engineer |
Predict demand for scooters by neighborhood from the aggregated DynamoDB data. | Now you can use Athena to query transactional data directly from DynamoDB, and then aggregate the data for analysis and forecasting. This isn’t easily achieved by directly querying a DynamoDB NoSQL database.
The SQL statement does the following:
For information about using SQL to aggregate DynamoDB data and SageMaker inference data in Athena, see athena_long.sql | App developer, Data scientist |
Verify the output. | The output table includes the neighborhood, longitude, and latitude of the centroid of the neighborhood. It also includes the number of vehicles that are predicted for the next hour. The query produces the predictions for a selected point in time. You can make predictions for any other time by changing the expression If you have a real-time data feed in your DynamoDB table, change the timestamp to | App developer, Data scientist |
Task | Description | Skills required |
---|---|---|
Delete resources. |
| App developer, AWS DevOps |
Related resources
Amazon Athena Query Federation SDK
(GitHub) Querying geospatial data (AWS documentation)
Use ML predictions over Amazon DynamoDB data with Amazon Athena ML
(AWS Big Data Blog) Amazon ElastiCache (Redis OSS)
(AWS documentation) Amazon Neptune
(AWS documentation)