Reference architectures for Apache Iceberg on AWS - AWS Prescriptive Guidance

Reference architectures for Apache Iceberg on AWS

This section provides examples of how to apply best practices in different use cases such as batch ingestion and a data lake that combines batch and streaming data ingestion.

Nightly batch ingestion

For this hypothetical use case, let's say that your Iceberg table ingests credit card transactions on a nightly basis. Each batch contains only incremental updates, which must be merged into the target table. Several times per year, full historical data is received. For this scenario, we recommend the following architecture and configurations.

Note: This is just an example. The optimal configuration depends on your data and requirements.

Data flow diagram showing raw storage to Amazon EMR and AWS Glue ETL, then to AWS Glue Data Catalog and data lake.

Recommendations:

  • File size: 128 MB, because Apache Spark tasks process data in 128 MB chunks.

  • Write type: copy-on-write. As detailed earlier in this guide, this approach helps ensure that data is written in a read-optimized fashion.

  • Partition variables: year/month/day. In our hypothetical use case, we query recent data most frequently, although we occasionally run full table scans for the past two years of data. The goal of partitioning is to drive fast read operations based on the requirements of the use case.

  • Sort order: timestamp

  • Data catalog: AWS Glue Data Catalog

Data lake that combines batch and near real-time ingestion

You can provision a data lake on Amazon S3 that shares batch and streaming data across accounts and Regions. For an architecture diagram and details, see the AWS blog post Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena.