Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility
AWS Whitepaper

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Publication date: July 2017 (Document Details)

Abstract

Organizations are collecting and analyzing increasing amounts of data making it difficult for traditional on-premises solutions for data storage, data management, and analytics to keep pace. Amazon S3 and Amazon Glacier provide an ideal storage solution for data lakes. They provide options such as a breadth and depth of integration with traditional big data analytics tools as well as innovative query-in-place analytics tools that help you eliminate costly and complex extract, transform, and load processes. This guide explains each of these options and provides best practices for building your Amazon S3-based data lake.

Introduction

As organizations are collecting and analyzing increasing amounts of data, traditional on-premises solutions for data storage, data management, and analytics can no longer keep pace. Data siloes that aren’t built to work well together make it difficult to consolidate storage so that you can perform comprehensive and efficient analytics. This limits an organization’s agility, ability to derive more insights and value from its data, and capability to adopt more sophisticated analytics tools and processes as its needs evolve.

A data lake, which is a single platform combining storage, data governance, and analytics, is designed to address these challenges. It’s a centralized, secure, and durable cloud-based storage platform that allows you to ingest and store structured and unstructured data, and transform these raw data assets as needed. You don’t need an innovation-limiting pre-defined schema. You can use a complete portfolio of data exploration, reporting, analytics, machine learning, and visualization tools on the data. A data lake makes data and the optimal analytics tools available to more users, across more lines of business. This enables them to get all of the business insights they need, whenever they need them.

Until recently, the data lake had been more concept than reality. However, Amazon Web Services (AWS) has developed a data lake architecture that allows you to build data lake solutions cost-effectively using Amazon Simple Storage Service and other services.

Using the Amazon S3-based data lake architecture capabilities you can do the following:

  • Ingest and store data from a wide variety of sources into a centralized platform.

  • Build a comprehensive data catalog to find and use data assets stored in the data lake.

  • Secure, protect, and manage all of the data stored in the data lake.

  • Use tools and policies to monitor, analyze, and optimize infrastructure and data.

  • Transform raw data assets in place into optimized usable formats.

  • Query data assets in place.

  • Use a broad and deep portfolio of data analytics, data science, machine learning, and visualization tools.

  • Quickly integrate current and future third-party data-processing tools.

  • Easily and securely share processed datasets and results.

The remainder of this paper provides more information about each of these capabilities. The following figure illustrates a sample AWS data lake platform.


        Sample AWS data lake platform

Figure: Sample AWS data lake platform

On this page: