Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility
AWS Whitepaper

The Broader Analytics Portfolio

The power of a data lake built on AWS is that data assets get ingested and stored in one massively scalable, low cost, performant platform—and that data discovery, transformation, and SQL querying can all be done in place using innovative AWS services like AWS Glue, Amazon Athena, and Amazon Redshift Spectrum. In addition, there are a wide variety of other AWS services that can be directly integrated with Amazon S3 to create any number of sophisticated analytics, machine learning, and artificial intelligence (AI) data processing pipelines. This allows you to quickly solve a wide range of analytics business challenges on a single platform, against common data assets, without having to worry about provisioning hardware and installing and configuring complex software packages before loading data and performing analytics. Plus, you only pay for what you consume. Some of the most common AWS services that can be used with data assets in an Amazon S3-based data lake are described next.

Amazon EMR

Amazon EMR is a highly distributed computing framework used to quickly and easily process data in a cost-effective manner. Amazon EMR uses Apache Hadoop, an open source framework, to distribute data and processing across an elastically resizable cluster of EC2 instances and allows you to use all the common Hadoop tools such as Hive, Pig, Spark, and HBase. Amazon EMR does all the heavily lifting involved with provisioning, managing, and maintaining the infrastructure and software of a Hadoop cluster, and is integrated directly with Amazon S3. With Amazon EMR, you can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. In either scenario, you only pay for the hours the cluster is up. Amazon EMR supports a variety of EC2 instance types encompassing general purpose, compute, memory and storage I/O optimized (e.g., T2, C4, X1, and I3) instances, and all Amazon EC2 pricing options (On-Demand, Reserved, and Spot). When you launch an EMR cluster (also called a job flow), you choose how many and what type of EC2 instances to provision. Companies with many different lines of business and a large number of users can build a single data lake solution, store their data assets in Amazon S3, and then spin up multiple EMR clusters to share data assets in a multi-tenant fashion.

Amazon Machine Learning

Machine learning is another important data lake use case. Amazon Machine Learning (ML) is a data lake service that makes it easy for anyone to use predictive analytics and machine learning technology. Amazon ML provides visualization tools and wizards to guide you through the process of creating ML models without having to learn complex algorithms and technology. After the models are ready, Amazon ML makes it easy to obtain predictions for your application using API operations. You don’t have to implement custom prediction generation code or manage any infrastructure. Amazon ML can create ML models based on data stored in Amazon S3, Amazon Redshift, or Amazon RDS. Built-in wizards guide you through the steps of interactively exploring your data, training the ML model, evaluating the model quality, and adjusting outputs to align with business goals. After a model is ready, you can request predictions either in batches or by using the low-latency real-time API. As discussed earlier in this paper, a data lake built on AWS greatly enhances machine learning capabilities by combining Amazon ML with large historical data sets than can be cost effectively stored on Amazon Glacier, but can be easily recalled when needed to train new ML models.

Amazon QuickSight

Amazon QuickSight is a very fast, easy-to-use, business analytics service that makes it easy for you to build visualizations, perform ad hoc analysis, and quickly get business insights from your data assets stored in the data lake, anytime, on any device. You can use Amazon QuickSight to seamlessly discover AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3, connect to any or all of these data sources and data assets, and get insights from this data in minutes. Amazon QuickSight enables organizations using the data lake to seamlessly scale their business analytics capabilities to hundreds of thousands of users. It delivers fast and responsive query performance by using a robust in-memory engine (SPICE).

Amazon Rekognition

Another innovative data lake service is Amazon Rekognition, which is a fully managed image recognition service powered by deep learning, run against image data assets stored in Amazon S3. Amazon Rekognition has been built by Amazon’s Computer Vision teams over many years, and already analyzes billions of images every day. The Amazon Rekognition easy-to-use API detects thousands of objects and scenes, analyzes faces, compares two faces to measure similarity, and verifies faces in a collection of faces. With Amazon Rekognition, you can easily build applications that search based on visual content in images, analyze face attributes to identify demographics, implement secure face-based verification, and more. Amazon Rekognition is built to analyze images at scale and integrates seamlessly with data assets stored in Amazon S3, as well as AWS Lambda and other key AWS services.

These are just a few examples of powerful data processing and analytics tools that can be integrated with a data lake built on AWS. See the AWS website for more examples and for the latest list of innovative AWS services available for data lake users.