Full data migration
Full data migration recreates the data files as well as the metadata. This approach takes longer and requires additional computing resources compared with in-place migration. However, full data migration offers significant opportunities to improve table quality and optimize data storage and access patterns.
During full data migration, you can perform several beneficial operations, such as data validation to ensure integrity and correctness, schema modifications to better meet current requirements, and partition strategy adjustments for improved query performance. You can also re-sort data to optimize common access patterns, implement Iceberg hidden partitioning for enhanced query efficiency, and perform file format conversion (for example, from CSV to Parquet) if desired.
These capabilities make full data migration ideal for transitioning to Iceberg format and for comprehensively refining and optimizing your data storage strategy. Although full data migration requires more time and resources up front, the resulting improvements in data quality, organization, and query performance can provide long-term benefits. To implement full data migration, use one of the following options:
-
Use the
CREATE TABLE ... AS SELECT
(CTAS) statement in Spark (on Amazon EMR or AWS Glue) or in Athena. You can set the partition specification and table properties for the new Iceberg table by using theĀ PARTITIONED BY
andĀTBLPROPERTIES
clauses. You can change the schema and partitioning for the new table according to your needs instead of inheriting them from the source table. -
Read from the source table and write the data as a new Iceberg table by using Spark on Amazon EMR or AWS Glue. For more information, see Creating a table
in the Iceberg documentation.