Data Lake Solution
Data Lake Solution

Overview

Many Amazon Web Services customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to store all of their data, structured and unstructured, in a centralized repository. An effective data lake should provide low-cost, scalable, and secure storage, and support search and analysis capabilities on a variety of data types.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets. The solution is intended to address common customer pain points around conceptualizing data lake architectures and transforming and analyzing data. The solution automatically configures the core AWS services necessary to easily tag, search, share, and govern specific subsets of data across a company or with other external users. This solution allows users to catalog new datasets, upload datasets with searchable metadata, and create data profiles for existing datasets in Amazon Simple Storage Service (Amazon S3) with minimal effort.

The data lake solution stores and registers datasets of any size in their native form in the secure, durable, highly-scalable Amazon S3. Customers can upload datasets with searchable metadata and integrate with AWS Glue and Amazon Athena to transform and analyze that data.

The solution automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. Additionally, user-defined tags are stored in Amazon DynamoDB to add business-relevant context to each dataset. The solution enables companies to create simple governance policies to require specific tags when datasets are registered with the data lake. Users can browse available datasets or search on dataset attributes and tags to quickly find and access data relevant to their business needs.

Additionally, the data lake solution includes a federated template that allows you to launch a version of the solution that is ready to integrate with your existing SAML identity provider such as Microsoft Active Directory. For more information, see Appendix A.

Cost

You are responsible for the cost of the AWS services used while running the data lake solution. The total cost for running this solution depends on the amount of data being loaded, requested, stored, processed, and presented. For full details, see the pricing webpage for each AWS service you will be using in this solution.

On this page: