Using the Iceberg framework in AWS Glue - AWS Glue

Using the Iceberg framework in AWS Glue

AWS Glue 3.0 and later supports the Apache Iceberg framework for data lakes. Iceberg provides a high-performance table format that works just like a SQL table. This topic covers available features for using your data in AWS Glue when you transport or store your data in an Iceberg table. To learn more about Iceberg, see the official Apache Iceberg documentation.

You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using the AWS Glue Data Catalog. Additional operations including insert, update, and all Spark Queries Spark Writes are also supported.

Note

ALTER TABLE … RENAME TO is not available for Apache Iceberg 0.13.1 for AWS Glue 3.0.

The following table lists the version of Iceberg included in each AWS Glue version.

AWS Glue version Supported Iceberg version
4.0 1.0.0
3.0 0.13.1

To learn more about the data lake frameworks that AWS Glue supports, see Using data lake frameworks with AWS Glue ETL jobs.

Enabling the Iceberg framework

To enable Iceberg for AWS Glue, complete the following tasks:

  • Specify iceberg as a value for the --datalake-formats job parameter. For more information, see AWS Glue job parameters.

  • Create a key named --conf for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using SparkConf in your script. These settings help Apache Spark correctly handle Iceberg tables.

    spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

    If you use AWS Glue 3.0 with Iceberg 0.13.1, you must set the following additional configurations to use Amazon DynamoDB lock manager to ensure atomic transaction. AWS Glue 4.0 uses optimistic locking by default. For more information, see Iceberg AWS Integrations in the official Apache Iceberg documentation.

    --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=<your-dynamodb-table-name>

Using a different Iceberg version

To use a version of Iceberg that AWS Glue doesn't support, specify your own Iceberg JAR files using the --extra-jars job parameter. Do not include iceberg as a value for the --datalake-formats parameter.

Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog

This example script demonstrates how to write an Iceberg table to Amazon S3. The example uses Iceberg AWS Integrations to register the table to the AWS Glue Data Catalog.

Python
# Example: Create an Iceberg table from a DataFrame # and register the table to Glue Data Catalog dataFrame.createOrReplaceTempView("tmp_<your_table_name>") query = f""" CREATE TABLE glue_catalog.<your_database_name>.<your_table_name> USING iceberg AS SELECT * FROM tmp_<your_table_name> """ spark.sql(query)
Scala
// Example: Example: Create an Iceberg table from a DataFrame // and register the table to Glue Data Catalog dataFrame.createOrReplaceTempView("tmp_<your_table_name>") val query = """CREATE TABLE glue_catalog.<your_database_name>.<your_table_name> USING iceberg AS SELECT * FROM tmp_<your_table_name>""" spark.sql(query)

Example: Read an Iceberg table from Amazon S3 using the AWS Glue Data Catalog

This example reads the Iceberg table that you created in Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog.

Python

For this example, use the GlueContext.create_data_frame.from_catalog() method.

# Example: Read an Iceberg table from Glue Data Catalog from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) df = glueContext.create_data_frame.from_catalog( database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )
Scala

For this example, use the getCatalogSource method.

// Example: Read an Iceberg table from Glue Data Catalog import com.amazonaws.services.glue.GlueContext import org.apacke.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>", additionalOptions = additionalOptions) .getDataFrame() } }

Example: Insert a DataFrame into an Iceberg table in Amazon S3 using the AWS Glue Data Catalog

This example inserts data into the Iceberg table that you created in Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog.

Note

This example requires you to set the --enable-glue-datacatalog job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see AWS Glue job parameters.

Python

For this example, use the GlueContext.write_data_frame.from_catalog() method.

# Example: Insert into an Iceberg table from Glue Data Catalog from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) glueContext.write_data_frame.from_catalog( frame=dataFrame, database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )
Scala

For this example, use the getCatalogSink method.

// Example: Insert into an Iceberg table from Glue Data Catalog import com.amazonaws.services.glue.GlueContext import org.apacke.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getCatalogSink("<your_database_name>", "<your_table_name>", additionalOptions = additionalOptions) .writeDataFrame(dataFrame, glueContext) } }

Example: Read an Iceberg table from Amazon S3 using Spark

Prerequisites: You will need to provision a catalog for the Iceberg library to use. When using the AWS Glue Data Catalog, AWS Glue makes this straightforward. The AWS Glue Data Catalog is pre-configured for use by the Spark libraries as glue_catalog. Data Catalog tables are identified by a databaseName and a tableName. For more information about the AWS Glue Data Catalog, see Data Catalog and crawlers in AWS Glue.

If you are not using the AWS Glue Data Catalog, you will need to provision a catalog through the Spark APIs. For more information, see Spark Configuration in the Iceberg documentation.

This example reads an Iceberg table in Amazon S3 from the Data Catalog using Spark.

Python
# Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
Scala
// Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog val dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")