Maintaining tables by using compaction - AWS Prescriptive Guidance

Maintaining tables by using compaction

Iceberg includes features that enable you to carry out table maintenance operations after writing data to the table. Some maintenance operations focus on streamlining metadata files, while others enhance how the data is clustered in the files so that query engines can efficiently locate the necessary information to respond to user requests. This section focuses on compaction-related optimizations.

Iceberg compaction

In Iceberg, you can use compaction to perform four tasks:

  • Combining small files into larger files that are generally over 100 MB in size. This technique is known as bin packing.

  • Merging delete files with data files. Delete files are generated by updates or deletes that use the merge-on-read approach.

  • (Re)sorting the data in accordance with query patterns. Data can be written without any sort order or with a sort order that is suitable for writes and updates.

  • Clustering the data by using space filling curves to optimize for distinct query patterns, particularly z-order sorting.

On AWS, you can run table compaction and maintenance operations for Iceberg through Amazon Athena or by using Spark in Amazon EMR or AWS Glue.

When you run compaction by using the rewrite_data_files procedure, you can adjust several knobs to control the compaction behavior. The following diagram shows the default behavior of bin packing. Understanding bin packing compaction is key to understanding hierarchical sorting and Z-order sorting implementations, because they are extensions of the bin packing interface and operate in a similar manner. The main distinction is the additional step required for sorting or clustering the data.

Default bin packing behavior in Iceberg tables

In this example, the Iceberg table consists of four partitions. Each partition has a different size and different number of files. If you start a Spark application to run compaction, the application creates a total of four file groups to process. A file group is an Iceberg abstraction that represents a collection of files that will be processed by a single Spark job. That is, the Spark application that runs compaction will create four Spark jobs to process the data.

Tuning compaction behavior

The following key properties control how data files are selected for compaction:

  • MAX_FILE_GROUP_SIZE_BYTES sets the data limit for a single file group (Spark job) at 100 GB by default. This property is especially important for tables without partitions or tables with partitions that span hundreds of gigabytes. By setting this limit, you can break down operations to plan work and make progress while preventing resource exhaustion on the cluster.

    Note: Each file group is sorted separately. Therefore, if you want to perform a partition-level sort, you must adjust this limit to match the partition size.

  • MIN_FILE_SIZE_BYTES or MIN_FILE_SIZE_DEFAULT_RATIO defaults to 75 percent of the target file size set at the table level. For example, if a table has a target size of 512 MB, any file that is smaller than 384 MB is included in the set of files that will be compacted.

  • MAX_FILE_SIZE_BYTES or MAX_FILE_SIZE_DEFAULT_RATIO defaults to 180 percent of the target file size. As with the two properties that set minimum file sizes, these properties are used to identify candidate files for the compaction job.

  • MIN_INPUT_FILES specifies the minimum number of files to be compacted if a table partition size is smaller than the target file size. The value of this property is used to determine whether it is worthwhile to compact the files based on the number of files (defaults to 5).

  • DELETE_FILE_THRESHOLD specifies the minimum number of delete operations for a file before it's included in compaction. Unless you specify otherwise, compaction doesn't combine delete files with data files. To enable this functionality, you must set a threshold value by using this property. This threshold is specific to individual data files, so if you set it to 3, a data file will be rewritten only if there are three or more delete files that reference it.

These properties provide insight into the formation of the file groups in the previous diagram.

For example, the partition labeled month=01 includes two file groups because it exceeds the maximum size constraint of 100 GB. In contrast, the month=02 partition contains a single file group because it's under 100 GB. The month=03 partition doesn't satisfy the default minimum input file requirement of five files. As a result, it won't be compacted. Lastly, although the month=04 partition doesn't contain enough data to form a single file of the desired size, the files will be compacted because the partition includes more than five small files.

You can set these parameters for Spark running on Amazon EMR or AWS Glue. For Amazon Athena, you can manage similar properties by using the table properties that start with the prefix optimize_).

Running compaction with Spark on Amazon EMR or AWS Glue

This section describes how to properly size a Spark cluster to run Iceberg's compaction utility. The following example uses Amazon EMR Serverless, but you can use the same methodology in Amazon EMR on Amazon EC2 or Amazon EKS, or in AWS Glue.

You can take advantage of the correlation between file groups and Spark jobs to plan the cluster resources. To process the file groups sequentially, considering the maximum size of 100 GB per file group, you can set the following Spark properties:

  • spark.dynamicAllocation.enabled = FALSE

  • spark.executor.memory = 20 GB

  • spark.executor.instances = 5

If you want to speed up compaction, you can scale horizontally by increasing the number of file groups that are compacted in parallel. You can also scale Amazon EMR by using manual or dynamic scaling.

  • Manually scaling (for example, by a factor of 4)

    • MAX_CONCURRENT_FILE_GROUP_REWRITES = 4 (our factor)

    • spark.executor.instances = 5 (value used in the example) x 4 (our factor) = 20

    • spark.dynamicAllocation.enabled = FALSE

  • Dynamic scaling

    • spark.dynamicAllocation.enabled = TRUE (default, no action required)

    • MAX_CONCURRENT_FILE_GROUP_REWRITES = N (align this value with spark.dynamicAllocation.maxExecutors, which is 100 by default; based on the executor configurations in the example, you can set N to 20)

    These are guidelines to help size the cluster. However, you should also monitor the performance of your Spark jobs to find the best settings for your workloads.

Running compaction with Amazon Athena

Athena offers an implementation of Iceberg's compaction utility as a managed feature through the OPTIMIZE statement. You can use this statement to run compaction without having to evaluate the infrastructure.

This statement groups small files into larger files by using the bin packing algorithm and merges delete files with existing data files. To cluster the data by using hierarchical sorting or z-order sorting, use Spark on Amazon EMR or AWS Glue.

You can change the default behavior of the OPTIMIZE statement at table creation by passing table properties in the CREATE TABLE statement, or after table creation by using the ALTER TABLE statement. For default values, see the Athena documentation.

Recommendations for running compaction

Use case

Recommendation

Running bin packing compaction based on a schedule

  • Use the OPTIMIZE statement in Athena if you don't know how many small files your table contains. The Athena pricing model is based on the data scanned, so if there are no files to be compacted, there is no cost associated with these operations. To avoid encountering timeouts on Athena tables, run OPTIMIZE on a per-table-partition basis.

  • Use Amazon EMR or AWS Glue with dynamic scaling when you expect large volumes of small files to be compacted.

Running bin packing compaction based on events

  • Use Amazon EMR or AWS Glue with dynamic scaling when you expect large volumes of small files to be compacted.

Running compaction to sort data

  • Use Amazon EMR or AWS Glue, because sorting is an expensive operation and might need to spill data to disk.

Running compaction to cluster the data using z-order sorting

  • Use Amazon EMR or AWS Glue, because z-order sorting is a very expensive operation and might need to spill data to disk.

Running compaction on partitions that might be updated by other applications because of late-arriving data

  • Use Amazon EMR or AWS Glue. Enable the Iceberg PARTIAL_PROGRESS_ENABLED property. When you use this option, Iceberg splits the compaction output into multiple commits. If there is a collision (that is, if the data file is updated while compaction is running), this setting reduces the cost of retry by limiting it to the commit that includes the affected file. Otherwise, you might have to recompact all files.