10 – Choose the best-performing file format and partitioning
How do you select the best-performing file formats and partitioning? Selecting the best-performing file format and data partitioning for data-at-rest can have a large impact on the overall analytics workload efficiency.
ID | Priority | Best practice |
---|---|---|
☐ BP 10.1 |
Recommended | Select format based on data write frequency and patterns for append-only compared to in-place update. |
☐ BP 10.2 |
Recommended | Choose data formatting based on your data access pattern |
☐ BP 10.3 |
Recommended | Utilize compression techniques to both decrease storage requirements and enhance I/O efficiency. |
☐ BP 10.4 |
Recommended | Partition your data to enable efficient data pruning and reduce unnecessary file reads. |
For more details, refer to the following information:
-
Amazon Redshift Database Developer Guide: Creating data files for queries in Amazon Redshift Spectrum
-
Amazon EMR Release Guide: Hudi
-
AWS Big Data Blog: Apply record level changes from relational databases to Amazon S3 data lake
using Apache Hudi on Amazon EMR and AWS Database Migration service