Working with Apache Iceberg tables by using Amazon Athena SQL - AWS Prescriptive Guidance

Working with Apache Iceberg tables by using Amazon Athena SQL

Amazon Athena provides built-in support for Apache Iceberg, and doesn't require additional steps or configuration. This section provides a detailed overview of supported features and high-level guidance for using Athena to interact with Iceberg tables.

Version and feature compatibility

Note

The following sections assume that you're using Athena engine version 3.

Iceberg table specification support

The Apache Iceberg table specification specifies how Iceberg tables should behave. Athena supports table format version 2, so any Iceberg table that you create with the console, CLI, or SDK inherently uses that version.

If you use an Iceberg table that was created with another engine, such as Apache Spark on Amazon EMR or AWS Glue, make sure to set the table format version by using table properties. As a reference, see the section Creating and writing Iceberg tables earlier in this guide.

Iceberg feature support

You can use Athena to read from and write to Iceberg tables. When you change data by using the UPDATE, MERGE INTO, and DELETE FROM statements, Athena supports merge-on-read mode only. This property cannot be changed. In order to update or delete data with copy-on-write, you have to use other engines such as Apache Spark on Amazon EMR or AWS Glue. The following table summarizes Iceberg feature support in Athena.

DDL support DML support AWS Lake Formation for security (optional)
Table format Create table Schema evolution Reading data Writing data Row/column access control
Amazon Athena Version 2 X  Copy-on-write
✓  Merge-on-read
Note

Athena doesn't support Incremental queries.

Working with Iceberg tables

For a quick start to using Iceberg in Athena, see the section Getting started with Iceberg tables in Athena SQL earlier in this guide.

The following table lists limitations and recommendations.

Scenario

Limitation

Recommendation

Table DDL generation

Iceberg tables created with other engines can have properties that are not exposed in Athena. For these tables, it's not possible to generate the DDL.

Use the equivalent statement in the engine that created the table (for example, the SHOW CREATE TABLE statement for Spark).

Random Amazon S3 prefixes in objects that are written to an Iceberg table

By default, Iceberg tables that are created with Athena have the write.object-storage.enabled property enabled.

To disable this behavior and gain full control over Iceberg table properties, create an Iceberg table with another engine such as Spark on Amazon EMR or AWS Glue.

Incremental queries

Not currently supported in Athena.

To use incremental queries to enable incremental data ingestion pipelines, use Spark on Amazon EMR or AWS Glue.

Migrating existing tables to Iceberg

To migrate your current Athena or AWS Glue tables (also known as Hive tables) to Iceberg format, you can use either in-place or full data migration:

  • In-place migration is the process of generating Iceberg's metadata files on top of existing data files.

  • Full data migration creates the Iceberg metadata layer and also rewrites existing data files from the original table to the new Iceberg table.

The following sections provide an overview of the APIs available to migrate tables and guidance for choosing a migration strategy. For more information about these two strategies, see the Table Migration section in the Iceberg documentation.

In-place migration

In-place migration eliminates the need to rewrite all the data files. Instead, Iceberg metadata files are generated and linked to your existing data files. Iceberg offers three options for implementing in-place migration:

Currently, the migrate procedure doesn't work directly with the AWS Glue Data Catalog—it works only with the Hive metastore. If you have a requirement to use the migrate procedure instead of snapshot or add_files, you can use a temporary Amazon EMR cluster with the Hive metastore (HMS). This approach requires Iceberg version 1.2 or later.

Let's say you want to create the following Hive table:

Migrating a Hive table to Amazon Athena

You can create this Hive table by running this code in the Athena console:

CREATE EXTERNAL TABLE 'hive_table'( 'id' bigint, 'data' string) USING parquet LOCATION 's3://datalake-xxxx/aws_workshop/iceberg_db/hive_table' INSERT INTO iceberg_db.hive_table VALUES (1, 'a')

If your Hive table is partitioned, include the partition statement and add the partitions according to Hive requirements.

ALTER TABLE default.placeholder_table_for_migration ADD PARTITION (date = '2023-10-10')

Steps:

  1. Create an Amazon EMR cluster without enabling the AWS Glue Data Catalog integration—that is, don't select the check boxes for Hive or Spark table metadata. That's because you will use the native Hive metastore (HMS) that's available in the cluster for this workaround.

    AWS Glue Data Catalog settings without Hive or Spark metadata
  2. Configure the Spark session to use the Iceberg Hive catalog implementation.

    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.spark_catalog.type": "hive",
  3. Validate that your Amazon EMR cluster isn't connected to the AWS Glue Data Catalog by running show databases or show tables.

    Validating that the Amazon EMR cluster isn't connected to the AWS Glue Data Catalog
  4. Register the Hive table in the Hive metastore of your Amazon EMR cluster, and then use the Iceberg migrate procedure.

    Iceberg migrate procedure

    This procedure creates the Iceberg metadata files in the same location as the Hive table.

  5. Register the migrated Iceberg table in the AWS Glue Data Catalog.

  6. Switch back to an Amazon EMR cluster that has the AWS Glue Data Catalog integration enabled.

    AWS Glue Data Catalog settings with Spark metadata
  7. Use the following Iceberg configuration in the Spark session.

    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.glue_catalog.warehouse": "s3://datalake-xxxx/aws_workshop", "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",

You can now query this table from Amazon EMR, AWS Glue, or Athena.

Show tables command for Iceberg table

Full data migration

Full data migration recreates the data files as well as the metadata. This approach takes longer and requires additional computing resources compared with in-place migration. However, this option helps improve table quality: You can validate the data, make schema and partition changes, resort the data, and so on. To implement full data migration, use one of the following options:

  • Use the CREATE TABLE ... AS SELECT (CTAS) statement in Spark on Amazon EMR, AWS Glue, or Athena.  You can set the partition specification and table properties for the new Iceberg table by using the PARTITIONED BY and TBLPROPERTIES clauses. You can fine-tune the schema and partitioning for the new table according to your needs instead of simply inheriting them from the source table.

  • Read from the source table and write the data as a new Iceberg table by using Spark on Amazon EMR or AWS Glue (see Creating a table in the Iceberg documentation).

Choosing a migration strategy

To choose the best migration strategy, consider the questions in the following table.

Question

Recommendation

What is the data file format (for example, CSV or Apache Parquet)?

  • Consider in-place migration if your table file format is Parquet, ORC, or Avro.

  • For other formats such as CSV, JSON, and so on, use full data migration.

Do you want to update or consolidate the table schema?

  • If you want to evolve the table schema by using Iceberg native capabilities, consider in-place migration. For example, you can rename columns after the migration. (The schema can be changed in the Iceberg metadata layer.)

  • If you want to delete entire columns from data files, we recommend that you use full data migration.

Would the table benefit from changing the partition strategy?

  • If Iceberg's partitioning approach meets your requirements (for example, new data is stored by using the new partition layout while existing partitions remain as is), consider in-place migration.

  • If you want to use hidden partitions in your table, consider full data migration. For more information about hidden partitions, see the Best practices section.

Would the table benefit from adding or changing the sort order strategy?

  • Adding or changing the sort order of your data requires rewriting the dataset. In this case, consider using full data migration.

  • For large tables where it's prohibitively expensive to rewrite all the table partitions, consider using in-place migration and run compaction (with sorting enabled) for the most frequently accessed partitions.

Does the table have many small files?

  • Merging small files into larger files requires rewriting the dataset. In this case, consider using full data migration.

  • For large tables where it's prohibitively expensive to rewrite all the table partitions, consider using in-place migration and run compaction (with sorting enabled) for the most frequently accessed partitions.