Using Apache Iceberg tables with Amazon Redshift - Amazon Redshift

Using Apache Iceberg tables with Amazon Redshift

You can use Redshift Spectrum or Redshift Serverless to query Apache Iceberg tables cataloged in the AWS Glue Data Catalog. Apache Iceberg is an open-source table format for data lakes. For more information, see Apache Iceberg in the Apache Iceberg documentation.

Amazon Redshift provides transactional consistency for querying Apache Iceberg tables. You can manipulate the data in your tables using ACID (atomicity, consistency, isolation, durability) compliant services such as Amazon Athena and Amazon EMR while running queries using Amazon Redshift. Amazon Redshift can use the table statistics stored in Apache Iceberg metadata to optimize query plans and reduce file scans during query processing. With Amazon Redshift SQL, you can join Redshift tables with data lake tables.

To get started using Iceberg tables with Amazon Redshift:

  1. Create an Apache Iceberg table on an AWS Glue Data Catalog database using a compatible service such as Amazon Athena or Amazon EMR. To create an Iceberg table using Athena, see Using Apache Iceberg tables in the Amazon Athena User Guide.

  2. Create an Amazon Redshift cluster or Redshift Serverless workgroup with an associated IAM role that allows access to your data lake. For information on how to create clusters or workgroups, see Amazon Redshift provisioned clusters and Redshift Serverless in the Amazon Redshift Getting Started Guide.

  3. Connect to your cluster or workgroup using query editor v2 or a third-party SQL client. For information about how to connect using query editor v2, see Connecting to an Amazon Redshift database in the Amazon Redshift Management Guide.

  4. Create an external schema in your Amazon Redshift database for a specific Data Catalog database that includes your Iceberg tables. For information about creating an external schema, see Creating external schemas for Amazon Redshift Spectrum.

  5. Run SQL queries to access the Iceberg tables in the external schema you created.

Considerations when using Apache Iceberg tables with Amazon Redshift

Consider the following when using Amazon Redshift with Iceberg tables:

  • Iceberg version support – Amazon Redshift supports running queries against the following versions of Iceberg tables:

    • Version 1 defines how large analytic tables are managed using immutable data files.

    • Version 2 adds the ability to support row-level updates and deletes while keeping the existing data files unchanged, and handling table data changes using delete files.

    For the difference between version 1 and version 2 tables, see Format version changes in the Apache Iceberg documentation.

  • Queries only – Amazon Redshift supports read-only access to Apache Iceberg tables. It supports transactional consistent select queries. You can use a service like Amazon Athena to define and update the schema of Iceberg tables in the AWS Glue Data Catalog.

  • Adding partitions – You don't need to manually add partitions for your Apache Iceberg tables. New partitions in Apache Iceberg tables are automatically detected by Amazon Redshift and no manual operation is needed to update partitions in the table definition. Any changes in partition specification are also automatically applied to your queries without any user intervention.

  • Ingesting Iceberg data into Amazon Redshift – You can use INSERT INTO or CREATE TABLE AS commands to import data from your Iceberg table into a local Amazon Redshift table. You currently cannot use the COPY command to ingest the contents of an Apache Iceberg table into a local Amazon Redshift table.

  • Materialized views – You can create materialized views on Apache Iceberg tables like any other external table in Amazon Redshift. The same considerations for other data lake table formats apply to Apache Iceberg tables. Incremental updates, automatic refreshes, automatic query rewriting, and automatic MVs on data lake tables are currently not supported.

  • AWS Lake Formation fine-grained access control – Amazon Redshift supports AWS Lake Formation fine-grained access control on Apache Iceberg tables.

  • User-defined data handling parameters – Amazon Redshift supports user-defined data handling parameters on Apache Iceberg tables. You use user-defined data handling parameters on existing files to tailor the data being queried in external tables to avoid scan errors. These parameters provide capabilities to handle mismatches between the table schema and the actual data on files. You can use user-defined data handling parameters on Apache Iceberg tables as well.

  • Data sharing – Amazon Redshift data sharing currently doesn’t support data lake tables, including Apache Iceberg tables.

  • Time travel queries – Time travel queries are currently not supported with Apache Iceberg tables.

  • Pricing – When you access Iceberg tables from a cluster, you are charged Redshift Spectrum pricing. When you access Iceberg tables from a workgroup, you are charged Redshift Serverless pricing. For information about Redshift Spectrum and Redshift Serverless pricing, see Amazon Redshift pricing.