This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
2.1 Big Data, analytics, and machine learning use cases
Requirements addressed:
-
REQ1 (data residency)
-
REQ3 (data access controls)
-
REQ4 (availability and durability)
AWS services –
Amazon Redshift

Amazon analytics and machine learning at the data lake
Requirements REQ2 (data protection) and Customer-REQ1 (reliable connectivity) can be met through complimentary use of architecture 1.1: Hybrid network connectivity from a data center to the AWS Cloud.
Scalability and durability of services, as well as the pay-as-you-go model, make many customers choose AWS for their analytics and ML workloads. When data residency is required, hybrid architectures can be used for analytics and ML cases as well. In this example:
-
The data source layer consists any kind of data sources, such as relational databases (Oracle, MySQL, PostgreSQL), enterprise applications (SAP, CRM), S3-compatible object storages (Ceph, MinIO, Eucaliptus), and Hadoop Distributed File System (HDFS), located on-premises.
-
Export from relational databases is done by AWS Database Migration Service
(AWS DMS) with a full load (one-time export) or change data capture (continued date export) options, which transfers data into an Amazon S3 data lake to build a highly available and scalable data lake solution. AWS DMS can migrate data from the most widely used commercial and open-source databases. -
Amazon S3-compatible object storages and enterprise applications export data as files, and store it in shared folders on Network File System (NFS) or Server Message Block (SMB) file servers.
-
AWS DataSync agents transfer data files from shared folders to AWS DataSync. DataSync optionally performs data integrity verification and puts files into the Amazon S3 data lake.
-
Customers with local HDFS can use a DataSync agent to replicate data into the Amazon S3 data lake as described in the previous step, or use the S3DistCp command to copy large amounts of data from HDFS to S3 directly. The command for S3DistCp in Amazon EMR version 4.0 and later is s3-dist-cp, which you can add as a step in a cluster, or at the command line. S3DistCp can also be run in a local Hadoop cluster. Additional information can be found in the DistCp Guide
. -
Another data flow from devices and applications, located anywhere, for near real-time data ingestion (such as streaming data) can be performed with Amazon Data Firehose
or Amazon MSK (through various Kafka Connect connectors) and stored in the Amazon S3 data lake, or processed in near real-time with Amazon Managed Service for Apache Flink . -
From any location (on-premises or remote), customers can use Amazon ML services such as Amazon SageMaker AI Studio
to build and train ML models for any use case with fully managed cloud infrastructure, tools, and workflows. Amazon SageMaker AI lets you use data, stored in the Amazon S3 data lake, to train, tune, and validate ML models. Compiled models could be stored in Amazon S3 object storage and deployed at AWS endpoints, or downloaded and deployed locally. -
AWS Glue
is used to catalog the data and store it in Apache Hive metastore compatible format. -
Query data with Amazon Athena
, Amazon EMR Spark jobs, and/or Amazon Redshift Spectrum, which lets you analyze data in Amazon S3 using standard SQL.
Data residency requirement allows to collect, store, and process data at on-premises data center first, then copy, export, or back up data to the AWS Cloud (REQ1).
The Amazon S3 data lake provides a permissions model, to control access to stored data and metadata, that describes that data. This approach addresses the data access controls requirement (REQ3).
Using an Amazon S3 data lake to build a highly available and scalable data lake solution addresses the data availability and durability requirement (REQ4).
For more information about data lakes and analytics on AWS,
refer to
Analytics
on AWS