Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Defining Amazon S3 bucket and path names for data lake layers

Focus mode
Defining Amazon S3 bucket and path names for data lake layers - AWS Prescriptive Guidance

Andres Cantor, Amazon Web Services

April 2025 (document history)

This guide helps you create a consistent naming standard for Amazon Simple Storage Service (Amazon S3) buckets and paths in data lakes hosted on the AWS Cloud. The guide's naming standard for Amazon S3 buckets and paths helps you to improve governance and observability in your data lakes, identify costs by data layer and AWS account, and provides an approach for naming AWS Identity and Access Management (IAM) roles and policies.

We recommend that you use at least three data layers in your data lakes and that each layer uses a separate Amazon S3 bucket. However, some use cases might require an additional Amazon S3 bucket and data layer, depending on the data types that you generate and store. For example, if you store sensitive data, we recommend that you use a landing zone data layer and a separate Amazon S3 bucket. The following list describes the three recommended data layers for your data lake:

  • Raw data layer – Contains raw data and is the layer in which data is initially ingested. If possible, we recommend that you retain the original file format and turn on versioning in the Amazon S3 bucket.

  • Stage data layer – Contains intermediate, processed data that is optimized for consumption (for example CSV to Apache Parquet converted raw files or data transformations). An AWS Glue job reads the files from the raw layer and validates the data. The AWS Glue job then stores the data in an Apache Parquet-formatted file, and the metadata is stored in a table in the AWS Glue Data Catalog.

  • Analytics data layer – Contains the aggregated data for your specific use cases in a consumption-ready format, such as Apache Parquet.

Intended audience

This guide's recommendations are based on the authors' experience in implementing data lakes with the serverless data lake framework (SDLF) and are intended for data architects, data engineers, or solutions architects who want to set up a data lake on the AWS Cloud. However, make sure that you adapt this guide's approach to meet your organization's policies and requirements.

The guide contains the following sections:

Targeted business outcomes

You should expect the following outcomes after implementing a naming standard for Amazon S3 buckets and paths in data lakes on the AWS Cloud:

  • Improved governance in your data lake by being able to provide differentiated access policies to the buckets

  • Increased visibility into your overall costs for individual AWS accounts by using the relevant AWS account ID in the Amazon S3 bucket name and for data layers by using cost allocation tags for the buckets

  • More cost-effective data storage by using layer-based versioning and path-based lifecycle policies

  • Meet security requirements for data masking and data encryption

  • Simplify data source tracing by enhancing developer visibility into the AWS Region and AWS account of the underlying data storage

On this page

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.