Andres Cantor, Amazon Web Services
April 2025 (document history)
This guide helps you create a consistent naming standard for Amazon Simple Storage Service (Amazon S3) buckets and paths in data lakes hosted on the AWS Cloud. The guide's naming standard for Amazon S3 buckets and paths helps you to improve governance and observability in your data lakes, identify costs by data layer and AWS account, and provides an approach for naming AWS Identity and Access Management (IAM) roles and policies.
We recommend that you use at least three data layers in your data lakes and that each layer uses a separate Amazon S3 bucket. However, some use cases might require an additional Amazon S3 bucket and data layer, depending on the data types that you generate and store. For example, if you store sensitive data, we recommend that you use a landing zone data layer and a separate Amazon S3 bucket. The following list describes the three recommended data layers for your data lake:
-
Raw data layer – Contains raw data and is the layer in which data is initially ingested. If possible, we recommend that you retain the original file format and turn on versioning in the Amazon S3 bucket.
-
Stage data layer – Contains intermediate, processed data that is optimized for consumption (for example CSV to Apache Parquet converted raw files or data transformations). An AWS Glue job reads the files from the raw layer and validates the data. The AWS Glue job then stores the data in an Apache Parquet-formatted file, and the metadata is stored in a table in the AWS Glue Data Catalog.
-
Analytics data layer – Contains the aggregated data for your specific use cases in a consumption-ready format, such as Apache Parquet.
Intended audience
This guide's recommendations are based on the authors' experience in implementing data
lakes with the serverless data lake framework
(SDLF)
The guide contains the following sections:
Targeted business outcomes
You should expect the following outcomes after implementing a naming standard for Amazon S3 buckets and paths in data lakes on the AWS Cloud:
-
Improved governance in your data lake by being able to provide differentiated access policies to the buckets
-
Increased visibility into your overall costs for individual AWS accounts by using the relevant AWS account ID in the Amazon S3 bucket name and for data layers by using cost allocation tags for the buckets
-
More cost-effective data storage by using layer-based versioning and path-based lifecycle policies
-
Meet security requirements for data masking and data encryption
-
Simplify data source tracing by enhancing developer visibility into the AWS Region and AWS account of the underlying data storage