Creating Iceberg tables - Amazon Athena

Creating Iceberg tables

To create an Iceberg table for use in Athena, you can use a CREATE TABLE statement as documented on this page, or you can use an AWS Glue crawler.

Using a CREATE TABLE statement

Athena creates Iceberg v2 tables. For the difference between v1 and v2 tables, see Format version changes in the Apache Iceberg documentation.

Athena CREATE TABLE creates an Iceberg table with no data. You can query a table from external systems such as Apache Spark directly if the table uses the Iceberg open source glue catalog. You do not have to create an external table.

Warning

Running CREATE EXTERNAL TABLE results in the error message External keyword not supported for table type ICEBERG.

To create an Iceberg table from Athena, set the 'table_type' table property to 'ICEBERG' in the TBLPROPERTIES clause, as in the following syntax summary.

CREATE TABLE [db_name.]table_name (col_name data_type [COMMENT col_comment] [, ...] ) [PARTITIONED BY (col_name | transform, ... )] LOCATION 's3://DOC-EXAMPLE-BUCKET/your-folder/' TBLPROPERTIES ( 'table_type' ='ICEBERG' [, property_name=property_value] )

For information about the data types that you can query in Iceberg tables, see Supported data types for Iceberg tables in Athena.

Partitioning

To create Iceberg tables with partitions, use PARTITIONED BY syntax. Columns used for partitioning must be specified in the columns declarations first. Within the PARTITIONED BY clause, the column type must not be included. You can also define partition transforms in CREATE TABLE syntax. To specify multiple columns for partitioning, separate the columns with the comma (,) character, as in the following example.

CREATE TABLE iceberg_table (id bigint, data string, category string) PARTITIONED BY (category, bucket(16, id)) LOCATION 's3://DOC-EXAMPLE-BUCKET/your-folder/' TBLPROPERTIES ( 'table_type' = 'ICEBERG' )

The following table shows the available partition transform functions.

Function Description Supported types
year(ts) Partition by year date, timestamp
month(ts) Partition by month date, timestamp
day(ts) Partition by day date, timestamp
hour(ts) Partition by hour timestamp
bucket(N, col) Partition by hashed value mod N buckets. This is the same concept as hash bucketing for Hive tables. int, long, decimal, date, timestamp, string, binary
truncate(L, col) Partition by value truncated to L int, long, decimal, string

Athena supports Iceberg's hidden partitioning. For more information, see Iceberg's hidden partitioning in the Apache Iceberg documentation.

Table properties

This section describes table properties that you can specify as key-value pairs in the TBLPROPERTIES clause of the CREATE TABLE statement. Athena allows only a predefined list of key-value pairs in the table properties for creating or altering Iceberg tables. The following tables show the table properties that you can specify. For more information about the compaction options, see Optimizing Iceberg tables in this documentation. If you would like Athena to support a specific open source table configuration property, send feedback to athena-feedback@amazon.com.

format

Description File data format
Allowed property values Supported file format and compression combinations vary by Athena engine version. For more information, see Iceberg table compression support by file format.
Default value parquet

write_compression

Description File compression codec
Allowed property values Supported file format and compression combinations vary by Athena engine version. For more information, see Iceberg table compression support by file format.
Default value

Default write compression varies by Athena engine version. For more information, see Iceberg table compression support by file format.

optimize_rewrite_data_file_threshold

Description Data optimization specific configuration. If there are fewer data files that require optimization than the given threshold, the files are not rewritten. This allows the accumulation of more data files to produce files closer to the target size and skip unnecessary computation for cost saving.
Allowed property values A positive number. Must be less than 50.
Default value 5

optimize_rewrite_delete_file_threshold

Description Data optimization specific configuration. If there are fewer delete files associated with a data file than the threshold, the data file is not rewritten. This allows the accumulation of more delete files for each data file for cost saving.
Allowed property values A positive number. Must be less than 50.
Default value 2

vacuum_min_snapshots_to_keep

Description

Minimum number of snapshots to retain on a table's main branch.

This value takes precedence over the vacuum_max_snapshot_age_seconds property. If the minimum remaining snapshots are older than the age specified by vacuum_max_snapshot_age_seconds, the snapshots are kept, and the value of vacuum_max_snapshot_age_seconds is ignored.

Allowed property values A positive number.
Default value 1

vacuum_max_snapshot_age_seconds

Description Maximum age of the snapshots to retain on the main branch. This value is ignored if the remaining minimum of snapshots specified by vacuum_min_snapshots_to_keep are older than the age specified. This table behavior property corresponds to the history.expire.max-snapshot-age-ms property in Apache Iceberg configuration.
Allowed property values A positive number.
Default value 432000 seconds (5 days)

vacuum_max_metadata_files_to_keep

Description The maximum number of previous metadata files to retain on the table's main branch.
Allowed property values A positive number.
Default value 100

Example CREATE TABLE statement

The following example creates an Iceberg table that has three columns.

CREATE TABLE iceberg_table ( id int, data string, category string) PARTITIONED BY (category, bucket(16,id)) LOCATION 's3://DOC-EXAMPLE-BUCKET/iceberg-folder' TBLPROPERTIES ( 'table_type'='ICEBERG', 'format'='parquet', 'write_compression'='snappy', 'optimize_rewrite_delete_file_threshold'='10' )

CREATE TABLE AS SELECT (CTAS)

For information about creating an Iceberg table using the CREATE TABLE AS statement, see CREATE TABLE AS, with particular attention to the CTAS table properties section.

Using an AWS Glue crawler

You can use an AWS Glue crawler to automatically register your Iceberg tables into the AWS Glue Data Catalog. If you want to migrate from another Iceberg catalog, you can create and schedule an AWS Glue crawler and provide the Amazon S3 paths where the Iceberg tables are located. You can specify the maximum depth of the Amazon S3 paths that the AWS Glue crawler can traverse. After you schedule an AWS Glue crawler, the crawler extracts schema information and updates the AWS Glue Data Catalog with the schema changes every time it runs. The AWS Glue crawler supports schema merging across snapshots and updates the latest metadata file location in the AWS Glue Data Catalog. For more information, see Data Catalog and crawlers in AWS Glue.