Deliver data to Apache Iceberg Tables with Amazon Data Firehose - Amazon Data Firehose

Deliver data to Apache Iceberg Tables with Amazon Data Firehose

Apache Iceberg is a high-performance open-source table format for performing big data analytics. Apache Iceberg brings the reliability and simplicity of SQL tables to Amazon S3 data lakes, and makes it possible for open-source analytics engines like Spark, Flink, Trino, Hive, and Impala to work with the same data concurrently. For more information about Apache Iceberg, see https://iceberg.apache.org/.

You can use Firehose to deliver streaming data to Apache Iceberg Tables in Amazon S3. With this feature, you can route records from a single stream into different Apache Iceberg Tables, and automatically apply insert, update, and delete operations to records in the Apache Iceberg Tables. Firehose provides exactly once delivery to Iceberg Tables. This feature requires using the AWS Glue Data Catalog.

Firehose can also deliver streaming data to Amazon S3 Tables. Amazon S3 Tables provide storage that is optimized for large-scale analytics workloads, with features that continuously improve query performance and reduce storage costs for tabular data. With built-in support for Apache Iceberg, you can query tabular data in Amazon S3 with popular query engines including Amazon Athena, Amazon Redshift, and Apache Spark. For more information on Amazon S3 Tables, see Amazon S3 Tables. Firehose integration with Amazon S3 Tables is in preview in US East (Ohio), US East (N. Virginia), and US West (Oregon) Regions. Do not use it for your production workloads.

For Amazon S3 Tables, Firehose doesn't support the automatic creation of tables. You must create S3 Tables before creating a Firehose stream.