Maintaining tables by using compaction
Iceberg includes features that enable you to carry out table maintenance
operations
Iceberg compaction
In Iceberg, you can use compaction to perform four tasks:
-
Combining small files into larger files that are generally over 100 MB in size. This technique is known as bin packing.
-
Merging delete files with data files. Delete files are generated by updates or deletes that use the merge-on-read approach.
-
(Re)sorting the data in accordance with query patterns. Data can be written without any sort order or with a sort order that is suitable for writes and updates.
-
Clustering the data by using space filling curves to optimize for distinct query patterns, particularly z-order sorting.
On AWS, you can run table compaction and maintenance operations for Iceberg through Amazon Athena or by using Spark in Amazon EMR or AWS Glue.
When you run compaction by using the rewrite_data_files
In this example, the Iceberg table consists of four partitions. Each partition has a different size and different number of files. If you start a Spark application to run compaction, the application creates a total of four file groups to process. A file group is an Iceberg abstraction that represents a collection of files that will be processed by a single Spark job. That is, the Spark application that runs compaction will create four Spark jobs to process the data.
Tuning compaction behavior
The following key properties control how data files are selected for compaction:
-
MAX_FILE_GROUP_SIZE_BYTES
sets the data limit for a single file group (Spark job) at 100 GB by default. This property is especially important for tables without partitions or tables with partitions that span hundreds of gigabytes. By setting this limit, you can break down operations to plan work and make progress while preventing resource exhaustion on the cluster. Note: Each file group is sorted separately. Therefore, if you want to perform a partition-level sort, you must adjust this limit to match the partition size.
-
MIN_FILE_SIZE_BYTES
or MIN_FILE_SIZE_DEFAULT_RATIO defaults to 75 percent of the target file size set at the table level. For example, if a table has a target size of 512 MB, any file that is smaller than 384 MB is included in the set of files that will be compacted. -
MAX_FILE_SIZE_BYTES
or MAX_FILE_SIZE_DEFAULT_RATIO defaults to 180 percent of the target file size. As with the two properties that set minimum file sizes, these properties are used to identify candidate files for the compaction job. -
MIN_INPUT_FILES
specifies the minimum number of files to be compacted if a table partition size is smaller than the target file size. The value of this property is used to determine whether it is worthwhile to compact the files based on the number of files (defaults to 5). -
DELETE_FILE_THRESHOLD
specifies the minimum number of delete operations for a file before it's included in compaction. Unless you specify otherwise, compaction doesn't combine delete files with data files. To enable this functionality, you must set a threshold value by using this property. This threshold is specific to individual data files, so if you set it to 3, a data file will be rewritten only if there are three or more delete files that reference it.
These properties provide insight into the formation of the file groups in the previous diagram.
For example, the partition labeled month=01
includes two file groups
because it exceeds the maximum size constraint of 100 GB. In contrast, the
month=02
partition contains a single file group because it's under
100 GB. The month=03
partition doesn't satisfy the default minimum
input file requirement of five files. As a result, it won't be compacted. Lastly,
although the month=04
partition doesn't contain enough data to form a
single file of the desired size, the files will be compacted because the partition
includes more than five small files.
You can set these parameters for Spark running on Amazon EMR or AWS Glue. For Amazon Athena,
you can manage similar properties by using the table properties that start with the prefix
optimize_
).
Running compaction with Spark on Amazon EMR or AWS Glue
This section describes how to properly size a Spark cluster to run Iceberg's compaction utility. The following example uses Amazon EMR Serverless, but you can use the same methodology in Amazon EMR on Amazon EC2 or Amazon EKS, or in AWS Glue.
You can take advantage of the correlation between file groups and Spark jobs to plan the cluster resources. To process the file groups sequentially, considering the maximum size of 100 GB per file group, you can set the following Spark properties:
-
spark.dynamicAllocation.enabled
=FALSE
-
spark.executor.memory
=20 GB
-
spark.executor.instances
=5
If you want to speed up compaction, you can scale horizontally by increasing the number of file groups that are compacted in parallel. You can also scale Amazon EMR by using manual or dynamic scaling.
-
Manually scaling (for example, by a factor of 4)
-
MAX_CONCURRENT_FILE_GROUP_REWRITES
=4
(our factor) -
spark.executor.instances
=5
(value used in the example) x4
(our factor) =20
-
spark.dynamicAllocation.enabled
=FALSE
-
-
Dynamic scaling
-
spark.dynamicAllocation.enabled
=TRUE
(default, no action required) -
MAX_CONCURRENT_FILE_GROUP_REWRITES
= N
(align this value withspark.dynamicAllocation.maxExecutors
, which is 100 by default; based on the executor configurations in the example, you can setN
to 20)
These are guidelines to help size the cluster. However, you should also monitor the performance of your Spark jobs to find the best settings for your workloads.
-
Running compaction with Amazon Athena
Athena offers an implementation of Iceberg's compaction utility as a managed feature through the OPTIMIZE statement. You can use this statement to run compaction without having to evaluate the infrastructure.
This statement groups small files into larger files by using the bin packing algorithm and merges delete files with existing data files. To cluster the data by using hierarchical sorting or z-order sorting, use Spark on Amazon EMR or AWS Glue.
You can change the default behavior of the OPTIMIZE
statement at
table creation by passing table properties in the CREATE TABLE
statement, or after table creation by using the ALTER TABLE
statement.
For default values, see the Athena documentation.
Recommendations for running compaction
Use case |
Recommendation |
---|---|
Running bin packing compaction based on a schedule |
|
Running bin packing compaction based on events |
|
Running compaction to sort data |
|
Running compaction to cluster the data using z-order sorting |
|
Running compaction on partitions that might be updated by other applications because of late-arriving data |
|