Deliver data to Apache Iceberg Tables with Amazon Data Firehose
Apache Iceberg is a high-performance open-source table format for performing big data
analytics. Apache Iceberg brings the reliability and simplicity of SQL tables to Amazon S3 data
lakes, and makes it possible for open-source analytics engines like Spark, Flink, Trino,
Hive, and Impala to concurrently work with the same data. For more information about Apache
Iceberg, see https://iceberg.apache.org/
You can use Firehose to directly deliver streaming data to Apache Iceberg Tables in Amazon S3. With this feature, you can route records from a single stream into different Apache Iceberg Tables, and automatically apply insert, update, and delete operations to records in the Apache Iceberg Tables. Firehose guarantees exactly-once delivery to Iceberg Tables. This feature requires using the AWS Glue Data Catalog.
Note
Firehose supports Apache Iceberg Tables as a destination in US East (N. Virginia), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo), Canada (Central), and Asia Pacific (Sydney) AWS Regions.
Consideration and limitations
Firehose support for Apache Iceberg tables has the following considerations and limitations.
Throughput – If you use Direct PUT as the source to deliver data to Apache Iceberg tables, then the maximum throughput per stream is 5 MiB/second in US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions and 1 MiB/second in Asia Pacific (Tokyo), Canada (Central), and Asia Pacific (Sydney) Regions. If you just want to insert data to Iceberg tables with no updates and deletes and you want higher throughput for your stream, then you can use the Firehose Limits form
to request a throughput limit increase. Columns – For column names and values, Firehose takes only the first level of nodes in a multi-level nested JSON. For example, Firehose picks the nodes that are available in the first level including the position field. The column names and the data types of the source data should match with that of target tables for Firehose to successfully deliver. In this case, Firehose expects that you have either struct or map data type column in your Iceberg tables to match the position field. Firehose supports 16 levels of nesting. Following is an example of a nested JSON.
{ "version":"2016-04-01", "deviceId":"<solution_unique_device_id>", "sensorId":"<device_sensor_id>", "timestamp":"2024-01-11T20:42:45.000Z", "value":"<actual_value>", "position":{ "x":143.595901, "y":476.399628, "z":0.24234876 } }
If the column names or data types do not match, then Firehose throws an error and delivers data to S3 error bucket. If all the column names and data types match in the Apache Iceberg tables, but you have an additional field present in the source record, Firehose skips the new field.
One JSON object per record – You can send only one JSON object in one Firehose record. If you aggregate and send multiple JSON objects inside a record, Firehose throws an error and delivers data to S3 error bucket. If you aggregate records with KPL and ingest data into Firehose with Amazon Kinesis Data Streams as source, then Firehose automatically de-aggregates and uses one JSON object per record.
-
Compaction and storage optimization – Every time you write using Firehose, it commits and generates snapshots, small data files and delete files. Having thousands of small data files increases metadata overhead and affects read performance. To get optimal query performance, you might want to consider a solution that periodically takes small data files and rewrites into fewer larger data files. This process is called compaction. AWS Glue Data Catalog supports automatic compaction of your Apache Iceberg Tables. For more information, see Compaction management in the AWS Glue User Guide. For additional information, see Automatic compaction of Apache Iceberg Tables
. Besides compaction of data files, you can also optimize Iceberg tables by reducing storage consumption with VACUUM statement that performs table maintenance on Apache Iceberg tables. Alternatively, you can use AWS Glue Data Catalog that also supports managed table optimization of Apache Iceberg tables by automatically removing the data files, orphaned files, and expire snapshots that are no longer needed. For more information, see this blog post on Storage optimization of Apache Iceberg Tables
.