In-place migration
In-place migration eliminates the need to rewrite all your data files. Instead, Iceberg metadata files are generated and linked to your existing data files. This method is typically faster and more cost-effective, especially for large datasets or tables that have compatible file formats such as Parquet, Avro, and ORC.
Note
In-place migration cannot be used when migrating to Amazon S3 Tables.
Iceberg offers two main options for implementing in-place migration:
-
Using the snapshot
procedure to create a new Iceberg table while keeping the source table unchanged. For more information, see Snapshot Table in the Iceberg documentation. -
Using the migrate
procedure to create a new Iceberg table as a substitution for the source table. For more information, see Migrate Table in the Iceberg documentation. Although this procedure works with Hive Metastore (HMS), it isn't currently compatible with the AWS Glue Data Catalog. The Replicating the table migration procedure in AWS Glue Data Catalog section later in this guide provides a workaround for achieving a similar outcome with the Data Catalog.
After you perform in-place migration by using either snapshot
or migrate
, some data
files might remain unmigrated. This typically happens when writers continue writing to
the source table during or after migration. To incorporate these remaining files into
your Iceberg table, you can use the add_files
Let's say you have a Parquet-based products
table that was created and populated
in Athena as follows:
CREATE EXTERNAL TABLE mydb.products ( product_id INT, product_name STRING ) PARTITIONED BY (category STRING) STORED AS PARQUET LOCATION 's3://amzn-s3-demo-bucket/products/'; INSERT INTO mydb.products VALUES (1001, 'Smartphone', 'electronics'), (1002, 'Laptop', 'electronics'), (2001, 'T-Shirt', 'clothing'), (2002, 'Jeans', 'clothing');
The following sections explain how you can use the snapshot
and migrate
procedures
with this table.
Option 1: snapshot procedure
The snapshot
procedure creates a new Iceberg table that has a
different name but replicates the schema and partitioning of the source table. This
operation leaves the source table completely unchanged both during and after the
action. It effectively creates a lightweight copy of the table, which is
particularly useful for testing scenarios or data exploration without risking
modifications to the original data source. This approach enables a transition period
where both the original table and the Iceberg table remain available (see the notes
at the end of this section). When testing is complete, you can move your new Iceberg
table to production by transitioning all writers and readers to the new
table.
You can run the snapshot
procedure by using Spark in any Amazon EMR deployment model
(for example, Amazon EMR on EC2, Amazon EMR on EKS, EMR Serverless) and AWS Glue.
To test in-place migration with the snapshot
Spark procedure, follow
these steps:
-
Launch a Spark application and configure the Spark session with the following settings:
-
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
-
"spark.sql.catalog.spark_catalog":"org.apache.iceberg.spark.SparkSessionCatalog"
-
"spark.sql.catalog.spark_catalog.type":"glue"
-
"spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
-
-
Run the
snapshot
procedure to create a new Iceberg table that points to the original table data files:spark.sql(f""" CALL system.snapshot( source_table => 'mydb.products', table => 'mydb.products_iceberg', location => 's3://amzn-s3-demo-bucket/products_iceberg/' ) """ ).show(truncate=False)
The output dataframe contains the
imported_files_count
(the numbers of files that were added). -
Validate the new table by querying it:
spark.sql(f""" SELECT * FROM mydb.products_iceberg LIMIT 10 """ ).show(truncate=False)
Notes
-
After you run the procedure, any data file modifications on the source table will throw the generated table out of sync. New files that you add won't be visible in the Iceberg table, and files that you removed will affect query capabilities in the Iceberg table. To avoid the synchronization issues:
-
If the new Iceberg table is intended for production use, stop all processes that write to the original table and redirect them to the new table.
-
If you need a transition period or if the new Iceberg table is for testing purposes, see Keeping Iceberg tables in sync after in-place migration later in this section for guidance on maintaining table synchronization.
-
-
When you use the
snapshot
procedure, thegc.enabled
property is set totrue
in the table properties of the created Iceberg table. This setting prohibits actions such asexpire_snapshots
,remove_orphan_files
, orDROP TABLE
with thePURGE
option, which would physically delete data files. Iceberg delete or merge operations, which do not directly impact source files, are still allowed. -
To make your new Iceberg table fully functional, with no limits on actions that physically delete data files, you can change the
gc.enabled
table property tofalse
. However, this setting will allow actions that impact source data files, which could corrupt access to the original table. Therefore, change thegc.enabled
property only if you no longer need to maintain the original table's functionality. For example:spark.sql(f""" ALTER TABLE mydb.products_iceberg SET TBLPROPERTIES ('gc.enabled' = 'false'); """)
Option 2: migrate procedure
The migrate
procedure creates a new Iceberg table that has the same
name, schema, and partitioning as the source table. When this procedure runs, it
locks the source table and renames it to <table_name>_BACKUP_
(or
a custom name specified by the backup_table_name
procedure
parameter).
Note
If you set the drop_backup
procedure parameter to
true
, the original table will not be retained as a
backup.
Consequently, the migrate
table procedure requires all modifications
that affect the source table to be stopped before the action is performed. Before
you run the migrate
procedure:
-
Stop all writers that interact with the source table.
-
Modify readers and writers that don't natively support Iceberg to enable Iceberg support.
For example:
-
Athena continues to work without modification.
-
Spark requires:
-
Iceberg Java Archive (JAR) files to be included in the classpath (see the Working with Iceberg in Amazon EMR and Working with Iceberg in AWS Glue sections earlier in this guide).
-
The following Spark session catalog configurations (using
SparkSessionCatalog
to add Iceberg support while maintaining built-in catalog functionalities for non-Iceberg tables):-
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
-
"spark.sql.catalog.spark_catalog":"org.apache.iceberg.spark.SparkSessionCatalog"
-
"spark.sql.catalog.spark_catalog.type":"glue"
-
"spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
-
-
After you run the procedure, you can restart your writers with their new Iceberg configuration.
Currently, the migrate
procedure isn't compatible with the
AWS Glue Data Catalog, because the Data Catalog doesn't support the RENAME
operation.
Therefore, we recommend that you use this procedure only when you're working with
Hive Metastore. If you're using the Data Catalog, see the next section for an alternative
approach.
You can run the migrate
procedure across all Amazon EMR deployment models
(Amazon EMR on EC2, Amazon EMR on EKS, EMR Serverless) and AWS Glue, but it requires a
configured connection to Hive Metastore. Amazon EMR on EC2 is the recommended choice
because it provides a built-in Hive Metastore configuration, which minimizes setup
complexity.
To test in-place migration with the migrate
Spark procedure from an
Amazon EMR on EC2 cluster that's configured with Hive Metastore, follow these
steps:
-
Launch a Spark application and configure the Spark session to use the Iceberg Hive catalog implementation. For example, if you're using the
pyspark
CLI:pyspark --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive
-
Create a
products
table in Hive Metastore. This is the source table, which already exists in a typical migration.-
Create the
products
external Hive table in Hive Metastore to point to the existing data in Amazon S3:spark.sql(f""" CREATE EXTERNAL TABLE products ( product_id INT, product_name STRING ) PARTITIONED BY (category STRING) STORED AS PARQUET LOCATION 's3://amzn-s3-demo-bucket/products/'; """ )
-
Add the existing partitions by using the
MSCK REPAIR TABLE
command:spark.sql(f""" MSCK REPAIR TABLE products """ )
-
Confirm that the table contains data by running a
SELECT
query:spark.sql(f""" SELECT * FROM products """ ).show(truncate=False)
Sample output:
-
-
Use the Iceberg
migrate
procedure:df_res=spark.sql(f""" CALL system.migrate( table => 'default.products' ) """ ) df_res.show()
The output dataframe contains the
migrated_files_count
(the numbers of files that were added to the Iceberg table): -
Confirm that the backup table was created:
spark.sql("show tables").show()
Sample output:
-
Validate the operation by querying the Iceberg table:
spark.sql(f""" SELECT * FROM products """ ).show(truncate=False)
Notes
-
After you run the procedure, all current processes that query or write to the source table will be impacted if they aren't properly configured with Iceberg support. Therefore, we recommend that you follow these steps:
-
Stop all processes by using the source table before migration.
-
Perform the migration.
-
Reactivate the processes by using the proper Iceberg settings.
-
-
If data file modifications occur during the migration process (new files are added or files are removed), the generated table will get out of sync. For synchronization options, see Keeping Iceberg tables in sync after in-place migration later in this section.
Replicating the table migration procedure in AWS Glue Data Catalog
You can replicate the migrate procedure's outcome in AWS Glue Data Catalog (backing up the original table and replacing it with an Iceberg table) by following these steps:
-
Use the snapshot procedure to create a new Iceberg table that points to the original table's data files.
-
Back up the original table metadata in the Data Catalog:
-
Use the GetTable API to retrieve the source table definition.
-
Use the GetPartitions API to retrieve the source table partition definition.
-
Use the CreateTable API to create a backup table in the Data Catalog.
-
Use the CreatePartition or BatchCreatePartition API to register partitions to the backup table in the Data Catalog.
-
-
Change the
gc.enabled
Iceberg table property tofalse
to enable full table operations. -
Drop the original table.
-
Locate the Iceberg table metadata JSON file in the metadata folder of the table's root location.
-
Register the new table in the Data Catalog by using the register_table
procedure with the original table name and the location of the metadata.json
file that was created by thesnapshot
procedure:spark.sql(f""" CALL system.register_table( table => 'mydb.products', metadata_file => '{iceberg_metadata_file}' ) """ ).show(truncate=False)
Keeping Iceberg tables in sync after in-place migration
The add_files
procedure provides a flexible way to incorporate
existing data into Iceberg tables. Specifically, it registers existing data files
(such as Parquet files) by referencing their absolute paths in Iceberg's metadata
layer. By default, the procedure adds files from all table partitions to an Iceberg
table, but you can selectively add files from specific partitions. This selective
approach is particularly useful in several scenarios:
-
When new partitions are added to the source table after initial migration.
-
When data files are added to or removed from existing partitions after initial migration. However, re-adding modified partitions requires partition deletion first. More information about this is provided later in this section.
Here are some considerations for using the add_file
procedure after
in-place migration (snapshot
or migrate
) has been
performed, to keep the new Iceberg table in sync with the source data files:
-
When new data is added to new partitions in the source table, use the
add_files
procedure with thepartition_filter
option to selectively incorporate these additions into the Iceberg table:spark.sql(f""" CALL system.add_files( source_table => 'mydb.products', table => 'mydb.products_iceberg', partition_filter => map('category', 'electronics') ).show(truncate=False)
or:
spark.sql(f""" CALL system.add_files( source_table => '`parquet`.`s3://amzn-s3-demo-bucket/products/`', table => 'mydb.products_iceberg', partition_filter => map('category', 'electronics') ).show(truncate=False)
-
The
add_files
procedure scans for files either in the entire source table or in specific partitions when you specify thepartition_filter
option, and attempts to add all files it finds to the Iceberg table. By default, thecheck_duplicate_files
procedure property is set totrue
, which prevents the procedure from running if files already exist in the Iceberg table. This is important because there is no built-in option to skip previously added files, and disablingcheck_duplicate_files
will cause files to be added twice, creating duplicates. When new files are added to the source table, follow these steps:-
For new partitions, use
add_files
with apartition_filter
to import only files from the new partition. -
For existing partitions, first delete the partition from the Iceberg table, and then re-run
add_files
for that partition, specifying thepartition_filter
. For example:# We initially perform in-place migration with snapshot spark.sql(f""" CALL system.snapshot( source_table => 'mydb.products', table => 'mydb.products_iceberg', location => 's3://amzn-s3-demo-bucket/products_iceberg/' ) """ ).show(truncate=False) # Then on the source table, some new files were generated under the category='electronics' partition. Example: spark.sql(""" INSERT INTO mydb.products VALUES (1003, 'Tablet', 'electronics') """) # We delete the modified partition from the Iceberg table. Note this is a metadata operation only spark.sql(""" DELETE FROM mydb.products_iceberg WHERE category = 'electronics' """) # We add_files from the modified partition spark.sql(""" CALL system.add_files( source_table => 'mydb.products', table => 'mydb.products_iceberg', partition_filter => map('category', 'electronics') ) """).show(truncate=False)
-
Note
Every add_files
operation generates a new Iceberg table snapshot
with appended data.
Choosing the right in-place migration strategy
To choose the best in-place migration strategy, consider the questions in the following table.
Question |
Recommendation |
Explanation |
---|---|---|
Do you want to quickly migrate without rewriting data while keeping both Hive and Iceberg tables accessible for testing or gradual transition? |
|
Use the |
Are you using Hive Metastore and do you want to replace your Hive table with an Iceberg table immediately, without rewriting the data? |
|
Use the Note: This option is compatible with Hive Metastore but not with AWS Glue Data Catalog. Use the |
Are you using AWS Glue Data Catalog and do you want to replace your Hive table with an Iceberg table immediately, without rewriting the data? |
Adaptation of the |
Replicate
Note: This option requires manual handling of AWS Glue API calls for metadata backup. Use the |