Best practices
We recommend that you follow storage and technical best practices. These best practices can help you get the most out of your data-centric archicture.
Storage best practices for big data
The following table describes a common best practice to store files for a big data processing load on Amazon S3. The last column is an example of a lifecycle policy that you can set. If Amazon S3 Intelligent-Tiering
Data layer name | Description | Example lifecycle policy strategy |
Raw | Contains raw, unprocessed data Note: For an external data source, the raw data layer is typically a 1:1 copy of the data, but on AWS the data can be partitioned by keys based on AWS Region or date during the ingestion process. | After one year, move files into the S3 Standard-IA storage class. After two years in S3 Standard-IA, archive the files in Amazon Simple Storage Service Glacier (Amazon S3 Glacier). |
Stage | Contains intermediate processed data that's optimized for consumption Example: CSV to Apache Parquet converted raw files or data transformations | You can delete data after a defined time period or according to your organization's requirements. You can remove some data derivatives (for example, an Apache Avro transform of an original JSON format) from the data lake after a shorter amount of time (for example, after 90 days). |
Analytics | Contains the aggregated data for your specific use cases in a consumption-ready format Example: Apache Parquet | You can move data to S3 Standard-IA, and then delete the data after a defined time period or according to your organization's requirements. |
The following diagram shows an example of a partitioning strategy (corresponding to one S3 folder/prefix) that you can use across all the data layers. We recommend that you choose a partitioning strategy based on how your data is used downstream. For example, if reports are built on your data (where most common queries on the report filter the results based on region and dates), then make sure to include the regions and dates as partitions to improve query performance and runtime.

Technical best practices
Technical best practices depend on the specific AWS services and processing technologies that you use to design your data-centric architecture. However, we recommend that you keep in mind the following best practices. These best practices apply to typical data processing use cases.
Area | Best practice |
SQL | Reduce the amount of data that must be queried by projecting attributes on your data. Instead of parsing the entire table, you can use data projection to scan and return only certain required columns in the table. Avoid large joins if possible because joins between multiple tables can significantly impact performance due to their resource-intensive demands. |
Apache Spark | Optimize Spark applications Optimize memory management |
Database design | Follow the Architecture Best Practices for Databases |
Data pruning | Use server-side partition pruning with the |
Scaling | Understand and implement horizontal scaling |