2.1 Big Data, analytics, and machine learning use cases

Requirements addressed:

REQ1 (data residency)
REQ3 (data access controls)
REQ4 (availability and durability)

AWS services – Amazon Redshift, AWS Glue, Amazon EMR, Amazon S3, AWS DataSync, Amazon Athena, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS Database Migration Service, and Amazon SageMaker AI

Amazon analytics and machine learning at the data lake

Requirements REQ2 (data protection) and Customer-REQ1 (reliable connectivity) can be met through complimentary use of architecture 1.1: Hybrid network connectivity from a data center to the AWS Cloud.

Scalability and durability of services, as well as the pay-as-you-go model, make many customers choose AWS for their analytics and ML workloads. When data residency is required, hybrid architectures can be used for analytics and ML cases as well. In this example:

The data source layer consists any kind of data sources, such as relational databases (Oracle, MySQL, PostgreSQL), enterprise applications (SAP, CRM), S3-compatible object storages (Ceph, MinIO, Eucaliptus), and Hadoop Distributed File System (HDFS), located on-premises.
Export from relational databases is done by AWS Database Migration Service (AWS DMS) with a full load (one-time export) or change data capture (continued date export) options, which transfers data into an Amazon S3 data lake to build a highly available and scalable data lake solution. AWS DMS can migrate data from the most widely used commercial and open-source databases.
Amazon S3-compatible object storages and enterprise applications export data as files, and store it in shared folders on Network File System (NFS) or Server Message Block (SMB) file servers.
AWS DataSync agents transfer data files from shared folders to AWS DataSync. DataSync optionally performs data integrity verification and puts files into the Amazon S3 data lake.
Customers with local HDFS can use a DataSync agent to replicate data into the Amazon S3 data lake as described in the previous step, or use the S3DistCp command to copy large amounts of data from HDFS to S3 directly. The command for S3DistCp in Amazon EMR version 4.0 and later is s3-dist-cp, which you can add as a step in a cluster, or at the command line. S3DistCp can also be run in a local Hadoop cluster. Additional information can be found in the DistCp Guide.
Another data flow from devices and applications, located anywhere, for near real-time data ingestion (such as streaming data) can be performed with Amazon Data Firehose or Amazon MSK (through various Kafka Connect connectors) and stored in the Amazon S3 data lake, or processed in near real-time with Amazon Managed Service for Apache Flink.
From any location (on-premises or remote), customers can use Amazon ML services such as Amazon SageMaker AI Studio to build and train ML models for any use case with fully managed cloud infrastructure, tools, and workflows. Amazon SageMaker AI lets you use data, stored in the Amazon S3 data lake, to train, tune, and validate ML models. Compiled models could be stored in Amazon S3 object storage and deployed at AWS endpoints, or downloaded and deployed locally.
AWS Glue is used to catalog the data and store it in Apache Hive metastore compatible format.
Query data with Amazon Athena, Amazon EMR Spark jobs, and/or Amazon Redshift Spectrum, which lets you analyze data in Amazon S3 using standard SQL.

Data residency requirement allows to collect, store, and process data at on-premises data center first, then copy, export, or back up data to the AWS Cloud (REQ1).

The Amazon S3 data lake provides a permissions model, to control access to stored data and metadata, that describes that data. This approach addresses the data access controls requirement (REQ3).

Using an Amazon S3 data lake to build a highly available and scalable data lake solution addresses the data availability and durability requirement (REQ4).

For more information about data lakes and analytics on AWS, refer to Analytics on AWS. For information on different approaches to move data into a data lake in AWS, refer to Data Lakes on AWS. For information on hybrid ML scenarios, refer to Hybrid Machine Learning.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Application layer

2.2 Local IdP for authentication flow and storing user data