Defining Amazon S3 bucket and path names for data lake layers

Focus mode

Defining Amazon S3 bucket and path names for data lake layers - AWS Prescriptive Guidance

Intended audience Targeted business outcomes

Andres Cantor, Amazon Web Services

April 2025 (document history)

This guide helps you create a consistent naming standard for Amazon Simple Storage Service (Amazon S3) buckets and paths in data lakes hosted on the AWS Cloud. The guide's naming standard for Amazon S3 buckets and paths helps you to improve governance and observability in your data lakes, identify costs by data layer and AWS account, and provides an approach for naming AWS Identity and Access Management (IAM) roles and policies.

We recommend that you use at least three data layers in your data lakes and that each layer uses a separate Amazon S3 bucket. However, some use cases might require an additional Amazon S3 bucket and data layer, depending on the data types that you generate and store. For example, if you store sensitive data, we recommend that you use a landing zone data layer and a separate Amazon S3 bucket. The following list describes the three recommended data layers for your data lake:

Raw data layer – Contains raw data and is the layer in which data is initially ingested. If possible, we recommend that you retain the original file format and turn on versioning in the Amazon S3 bucket.
Stage data layer – Contains intermediate, processed data that is optimized for consumption (for example CSV to Apache Parquet converted raw files or data transformations). An AWS Glue job reads the files from the raw layer and validates the data. The AWS Glue job then stores the data in an Apache Parquet-formatted file, and the metadata is stored in a table in the AWS Glue Data Catalog.
Analytics data layer – Contains the aggregated data for your specific use cases in a consumption-ready format, such as Apache Parquet.

Intended audience

This guide's recommendations are based on the authors' experience in implementing data lakes with the serverless data lake framework (SDLF) and are intended for data architects, data engineers, or solutions architects who want to set up a data lake on the AWS Cloud. However, make sure that you adapt this guide's approach to meet your organization's policies and requirements.

The guide contains the following sections:

Targeted business outcomes

You should expect the following outcomes after implementing a naming standard for Amazon S3 buckets and paths in data lakes on the AWS Cloud:

Improved governance in your data lake by being able to provide differentiated access policies to the buckets
Increased visibility into your overall costs for individual AWS accounts by using the relevant AWS account ID in the Amazon S3 bucket name and for data layers by using cost allocation tags for the buckets
More cost-effective data storage by using layer-based versioning and path-based lifecycle policies
Meet security requirements for data masking and data encryption
Simplify data source tracing by enhancing developer visibility into the AWS Region and AWS account of the underlying data storage

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Recommended data layers

Next topic:

Recommended data layers

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Defining Amazon S3 bucket and path names for data lake layers

Intended audience

Targeted business outcomes

Next topic:

Need help?

On this page

Did this page help you?