Storage best practices for big data Technical best practices

Best practices

We recommend that you follow storage and technical best practices. These best practices can help you get the most out of your data-centric archicture.

Storage best practices for big data

The following table describes a common best practice to store files for a big data processing load on Amazon S3. The last column is an example of a lifecycle policy that you can set. If Amazon S3 Intelligent-Tiering is enabled (which delivers automatic storage cost savings when data access patterns change automatically), then you don't have to manually set the policy.

Data layer name	Description	Example lifecycle policy strategy
Raw	Contains raw, unprocessed data Note: For an external data source, the raw data layer is typically a 1:1 copy of the data, but on AWS the data can be partitioned by keys based on AWS Region or date during the ingestion process.	After one year, move files into the S3 Standard-IA storage class. After two years in S3 Standard-IA, archive the files in Amazon Simple Storage Service Glacier (Amazon S3 Glacier).
Stage	Contains intermediate processed data that's optimized for consumption Example: CSV to Apache Parquet converted raw files or data transformations	You can delete data after a defined time period or according to your organization's requirements. You can remove some data derivatives (for example, an Apache Avro transform of an original JSON format) from the data lake after a shorter amount of time (for example, after 90 days).
Analytics	Contains the aggregated data for your specific use cases in a consumption-ready format Example: Apache Parquet	You can move data to S3 Standard-IA, and then delete the data after a defined time period or according to your organization's requirements.

The following diagram shows an example of a partitioning strategy (corresponding to one S3 folder/prefix) that you can use across all the data layers. We recommend that you choose a partitioning strategy based on how your data is used downstream. For example, if reports are built on your data (where most common queries on the report filter the results based on region and dates), then make sure to include the regions and dates as partitions to improve query performance and runtime.

Technical best practices

Technical best practices depend on the specific AWS services and processing technologies that you use to design your data-centric architecture. However, we recommend that you keep in mind the following best practices. These best practices apply to typical data processing use cases.

Area	Best practice
SQL	Reduce the amount of data that must be queried by projecting attributes on your data. Instead of parsing the entire table, you can use data projection to scan and return only certain required columns in the table. Avoid large joins if possible because joins between multiple tables can significantly impact performance due to their resource-intensive demands.
Apache Spark	Optimize Spark applications with workload partitioning in AWS Glue (AWS Big Data blog). Optimize memory management in AWS Glue (AWS Big Data blog).
Database design	Follow the Architecture Best Practices for Databases (AWS Architecture Center).
Data pruning	Use server-side partition pruning with the `catalogPartitionPredicate`.
Scaling	Understand and implement horizontal scaling.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Automation and access control

FAQ