Best practices - AWS Prescriptive Guidance

Best practices

We recommend that you follow storage and technical best practices. These best practices can help you get the most out of your data-centric archicture.

Storage best practices for big data

The following table describes a common best practice to store files for a big data processing load on Amazon S3. The last column is an example of a lifecycle policy that you can set. If Amazon S3 Intelligent-Tiering is enabled (which delivers automatic storage cost savings when data access patterns change automatically), then you don't have to manually set the policy.

Data layer name

Description

Example lifecycle policy strategy

Raw

Contains raw, unprocessed data

Note: For an external data source, the raw data layer is typically a 1:1 copy of the data, but on AWS the data can be partitioned by keys based on AWS Region or date during the ingestion process.

After one year, move files into the S3 Standard-IA storage class. After two years in S3 Standard-IA, archive the files in Amazon Simple Storage Service Glacier (Amazon S3 Glacier).

Stage

Contains intermediate processed data that's optimized for consumption

Example: CSV to Apache Parquet converted raw files or data transformations

You can delete data after a defined time period or according to your organization's requirements.

You can remove some data derivatives (for example, an Apache Avro transform of an original JSON format) from the data lake after a shorter amount of time (for example, after 90 days).

Analytics

Contains the aggregated data for your specific use cases in a consumption-ready format

Example: Apache Parquet

You can move data to S3 Standard-IA, and then delete the data after a defined time period or according to your organization's requirements.

The following diagram shows an example of a partitioning strategy (corresponding to one S3 folder/prefix) that you can use across all the data layers. We recommend that you choose a partitioning strategy based on how your data is used downstream. For example, if reports are built on your data (where most common queries on the report filter the results based on region and dates), then make sure to include the regions and dates as partitions to improve query performance and runtime.

Partitioning strategy diagram

Technical best practices

Technical best practices depend on the specific AWS services and processing technologies that you use to design your data-centric architecture. However, we recommend that you keep in mind the following best practices. These best practices apply to typical data processing use cases.

Area

Best practice

SQL

Reduce the amount of data that must be queried by projecting attributes on your data. Instead of parsing the entire table, you can use data projection to scan and return only certain required columns in the table.

Avoid large joins if possible because joins between multiple tables can significantly impact performance due to their resource-intensive demands.

Apache Spark

Optimize Spark applications with workload partitioning in AWS Glue (AWS Big Data blog).

Optimize memory management in AWS Glue (AWS Big Data blog).

Database design

Follow the Architecture Best Practices for Databases (AWS Architecture Center).

Data pruning

Use server-side partition pruning with the catalogPartitionPredicate.

Scaling

Understand and implement horizontal scaling.