Derive insights with outside-in data movement - Derive Insights from AWS Modern Data

Derive insights with outside-in data movement

You can also move data in the other direction: from the outside-in. For example, you can copy query results for sales of products in a given Region from your data warehouse into your data lake, to run product recommendation algorithms against a larger data set using machine learning. Think of this concept as outside-in data movement.


        Diagram showing outside-in data movement

Outside-in data movement

Derive insights from Amazon DynamoDB data for real-time prediction with Amazon SageMaker

Amazon DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency. Customers want to move valuable data in DynamoDB into S3 to derive insights. This data in S3 can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value.

The following diagram illustrates the Modern Data outside-in data movement with DynamoDB data to derive personalized recommendations.


          Diagram showing how to derive insights from Amazon DynamoDB data for real-time
            prediction with Amazon SageMaker

Derive insights from Amazon DynamoDB data for real-time prediction with Amazon SageMaker

The steps that data follows through the architecture are as follows:

  1. Export DynamoDB tables as JSON into Amazon S3.

  2. Exported JSON files are converted to comma-separated value (.csv) format to use as a data source for Amazon SageMaker by using AWS Glue.

  3. Amazon SageMaker renews the model artifact and updates the endpoint.

  4. The converted .csv file is available for ad hoc queries with Athena.

Derive insights from Amazon Aurora data with Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift

AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon S3. When the data is in Amazon S3, customers process it based on their analytics requirements. A typical requirement is to sync the data in S3 with the updates on the source systems. Although it’s easy to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi is a good way to solve this problem. Currently, you can use Hudi on Amazon EMR to create Hudi tables.

The following diagram illustrates the Modern Data outside-in data movement with Amazon Aurora Postgres-changed data to derive analytics.


           Diagram showing how to derive insights from Amazon Aurora data with Apache Hudi,
            AWS Glue, AWS DMS, and Amazon Redshift

Derive insights from Amazon Aurora data with Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift

The steps that data follows through the architecture are as follows:

  1. AWS DMS replicates the data from the Aurora cluster to the raw S3 bucket.

  2. Use Apache Hudi to create tables in the AWS Glue Data Catalog using AWS Glue jobs. An AWS Glue job (HudiJob) that is scheduled to run at a frequency set in the ScheduleToRunGlueJob parameter.

  3. This job reads the data from the raw S3 bucket, writes to the curated S3 bucket, and creates a Hudi table in the Data Catalog.

  4. The job also creates an Amazon Redshift external schema in the Amazon Redshift cluster.

  5. You can now query the Hudi table in Amazon Athena or Amazon Redshift.

Refer to the blog post Creating a source to Lakehouse data replication pipe using Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift for additional details.