Optimizing Iceberg tables - AWS Glue

Optimizing Iceberg tables

The Amazon S3 data lakes using open table formats such as Apache Iceberg store the data as Amazon S3 objects. Having thousands of small Amazon S3 objects in a data lake table increases metadata overhead on Iceberg tables and affects the read performance. For better read performance by AWS analytics services such as Amazon Athena and Amazon EMR, and AWS Glue ETL jobs, AWS Glue Data Catalog provides managed compaction (a process that compacts small Amazon S3 objects into larger objects) for Iceberg tables in Data Catalog. You can use Lake Formation console, AWS Glue console, AWS CLI, or AWS API to enable or disable compaction for individual Iceberg tables that are in the Data Catalog.

The table optimizer continuously monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes. An Iceberg table qualifies for compaction if the file size specified in the write.target-file-size-bytes property is within the 128MB to 512MB range. In the Data Catalog, the compaction process starts if the table has more than five files, each smaller than 75% of the write.target-file-size-bytes property.

For example, you have a table with the file size threshold set to 512MB in the write.target-file-size-bytes property (within the prescribed range of 128MB to 512MB), and the table contains 10 files. If 6 out of the 10 files are less than 384MB (.75*512) each, then the Data Catalog triggers compaction.

Data Catalog performs compaction without interfering with concurrent queries. Data Catalog supports data compaction only for tables in the Parquet format.

For supported data types, compression formats, and limitations, see Supported formats and limitations for managed data compaction .