Data Lake Solution
Data Lake Solution


Many Amazon Web Services customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to store all of their data, structured and unstructured, in a centralized repository. An effective data lake should provide low-cost, scalable, and secure storage, and support search and analysis capabilities on a variety of data types.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets. The solution is intended to address common customer pain points around conceptualizing data lake architectures, and automatically configures the core AWS services necessary to easily tag, search, share, and govern specific subsets of data across a company or with other external users. This solution allows users to catalog new datasets, and to create data profiles for existing datasets in Amazon Simple Storage Service (Amazon S3) with minimal effort.

The data lake solution stores and registers datasets of any size in their native form in the secure, durable, highly-scalable Amazon S3. Additionally, user-defined tags are stored in Amazon DynamoDB to add business-relevant context to each dataset. The solution enables companies to create simple governance policies to require specific tags when datasets are registered with the data lake. Users can browse available datasets or search on dataset attributes and tags to quickly find and access data relevant to their business needs. The solution keeps track of the datasets a user selects in a cart (similar to an online shopping cart) and then generates a manifest file with secure access links to the desired content when the user checks out.


You are responsible for the cost of the AWS services used while running this reference deployment. As of the date of publication, the cost for running the data lake solution with default settings in the US East (N. Virginia) Region is less than $1 per hour. This reflects Amazon API Gateway, AWS Lambda, Amazon DynamoDB, and Amazon Elasticsearch Service (Amazon ES) charges.

This cost does not include variable data storage and outbound data-transfer charges from Amazon S3 and Amazon CloudWatch Logs for data that the solution manages. For full details, see the pricing webpage for each AWS service you will be using in this solution.

On this page: