Choosing a migration strategy - AWS Prescriptive Guidance

Choosing a migration strategy

When transitioning to Iceberg format, the choice between in-place and full migration is crucial. To determine the most suitable approach for your specific needs, consider the following questions and recommendations:

Question Recommendation

What is the data file format (for example, CSV or Apache Parquet)?

  • Consider in-place migration if your table file format is Parquet, ORC, or Avro.

  • For other formats such as CSV, JSON, and so on, use full data migration.

Do you want to update or consolidate the table schema?

  • If you want to evolve the table schema by using Iceberg native capabilities, consider in-place migration. For example, you can rename columns after the migration. (The schema can be changed in the Iceberg metadata layer.)

  • If you want to remove entire columns because they are no longer needed, we recommend that you use full data migration.

Would the table benefit from changing the partition strategy?

  • If Iceberg's partitioning approach meets your requirements (for example, new data is stored by using the new partition layout while existing partitions remain as is), consider in-place migration.

  • If you want to use hidden partitions in your table, consider full data migration. For more information about hidden partitions, see the Best practices section.

Would the table benefit from adding or changing the sort order strategy?

  • Adding or changing the sort order of your data requires rewriting the dataset. In this case, consider using full data migration.

  • For large tables where it's prohibitively expensive to rewrite all the table partitions, consider using in-place migration and run compaction (with sorting enabled) for the most frequently accessed partitions.

Does the table have many small files?

  • Merging small files into larger files requires rewriting the dataset. In this case, consider using full data migration.

  • For large tables where it's prohibitively expensive to rewrite all the table partitions, consider using in-place migration and run compaction (with sorting enabled) for the most frequently accessed partitions.