Modern data lakes - AWS Prescriptive Guidance

Modern data lakes

Advanced use cases in modern data lakes

Data lakes offer one of the best options for storing data in terms of cost, scalability, and flexibility. You can use a data lake to retain large volumes of structured and unstructured data at a low cost, and use this data for different types of analytics workloads, from business intelligence reporting to big data processing, real-time analytics, machine learning, and generative artificial intelligence (AI), to help guide better decisions.

Despite these benefits, data lakes weren't initially designed with database-like capabilities. A data lake doesn't provide support for atomicity, consistency, isolation, and durability (ACID) processing semantics, which you might require to effectively optimize and manage your data at scale across hundreds or thousands of users by using a multitude of different technologies. Data lakes don't provide native support for the following functionality:

  • Performing efficient record-level updates and deletions as data changes in your business

  • Managing query performance as tables grow to millions of files and hundreds of thousands of partitions

  • Ensuring data consistency across multiple concurrent writers and readers

  • Preventing data corruption when write operations fail partway through the operation

  • Evolving table schemas over time without (partially) rewriting datasets

These challenges have become particularly prevalent in use cases such as handling change data capture (CDC) or use cases pertaining to privacy, deletion of data, and streaming data ingestion, which can result in sub-optimal tables.

Data lakes that use the traditional Hive-format tables support write operations only for entire files. This makes updates and deletes difficult to implement, time consuming, and costly. Moreover, concurrency controls and guarantees offered in ACID-compliant systems are needed to ensure data integrity and consistency.

To help overcome these challenges, Apache Iceberg provides additional database-like functionality that simplifies the optimization and management overhead of data lakes, while still supporting storage on cost-effective systems such as Amazon Simple Storage Service (Amazon S3).

Introduction to Apache Iceberg

Apache Iceberg is an open-source table format that provides features in data lake tables that were historically only available in databases or data warehouses. It's designed for scale and performance, and is well-suited for managing tables that are over hundreds of gigabytes. Some of the main features of Iceberg tables are:

  • Delete, update, and merge. Iceberg supports standard SQL commands for data warehousing for use with data lake tables.

  • Fast scan planning and advanced filtering. Iceberg stores metadata such as partition and column-level statistics that can be used by engines to speed up planning and running queries.

  • Full schema evolution. Iceberg supports adding, dropping, updating, or renaming columns without side-effects.

  • Partition evolution. You can update the partition layout of a table as data volume or query patterns change. Iceberg supports changing the columns that a table is partitioned on, or adding columns to, or removing columns from, composite partitions.

  • Hidden partitioning. This feature prevents reading unnecessary partitions automatically. This eliminates the need for users to understand the table's partitioning details or to add extra filters to their queries.

  • Version rollback. Users can quickly correct problems by reverting to a pre-transaction state.

  • Time travel. Users can query a specific previous version of a table.

  • Serializable isolation. Table changes are atomic, so readers never see partial or uncommitted changes.

  • Concurrent writers. Iceberg uses optimistic concurrency to allow multiple transactions to succeed. In case of conflicts, one of the writers has to retry the transaction.

  • Open file formats. Iceberg supports multiple open source file formats, including Apache Parquet, Apache Avro, and Apache ORC.

In summary, data lakes that use the Iceberg format benefit from transactional consistency, speed, scale, and schema evolution. For more information about these and other Iceberg features, see the Apache Iceberg documentation.

AWS support for Apache Iceberg

Apache Iceberg is supported by popular open-source data processing frameworks and by AWS services such as Amazon EMR, Amazon Athena, Amazon Redshift, and AWS Glue. The following diagram depicts a simplified reference architecture of a data lake that's based on Iceberg.

Transactional data lake architecture

The following AWS services provide native Iceberg integrations. There are additional AWS services that can interact with Iceberg, either indirectly or by packaging the Iceberg libraries.

  • Amazon S3 is the best place to build data lakes because of its durability, availability, scalability, security, compliance, and audit capabilities. Iceberg was designed and built to interact with Amazon S3 seamlessly, and provides support for many Amazon S3 features as listed in the Iceberg documentation.

  • Amazon EMR is a big data solution for petabyte-scale data processing, interactive analytics, and machine learning by using open source frameworks such as Apache Spark, Flink, Trino, and Hive. Amazon EMR can run on customized Amazon Elastic Compute Cloud (Amazon EC2) clusters, Amazon Elastic Kubernetes Service (Amazon EKS), AWS Outposts, or Amazon EMR Serverless.

  • Amazon Athena is a serverless, interactive analytics service that's built on open source frameworks. It supports open-table and file formats and provides a simplified, flexible way to analyze petabytes of data where it lives. Athena provides native support for read, time travel, write, and DDL queries for Iceberg and uses the AWS Glue Data Catalog for the Iceberg metastore.

  • Amazon Redshift is a petabyte-scale cloud data warehouse that supports both cluster-based and serverless deployment options. Amazon Redshift Spectrum can query external tables that are registered with the AWS Glue Data Catalog and stored on Amazon S3. Redshift Spectrum also provides support for the Iceberg storage format.

  • AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue 3.0 and later versions support the Iceberg framework for data lakes. You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables by using the AWS Glue Data Catalog. Additional operations such as insert, update, Spark queries, and Spark writes are also supported.

  • AWS Glue Data Catalog provides a Hive metastore-compatible data catalog service that supports Iceberg tables.

  • AWS Glue crawler provides automations to register Iceberg tables in the AWS Glue Data Catalog.

  • Amazon SageMaker supports the storage of feature sets in Amazon SageMaker Feature Store by using Iceberg format.

  • AWS Lake Formation provides coarse and fine-grained access control permissions to access data, including Iceberg tables consumed by Athena or Amazon Redshift. To learn more about permissions support for Iceberg tables, see the Lake Formation documentation.

AWS has a wide range of services that support Iceberg, but covering all these services is beyond the scope of this guide. The following sections cover Spark (batch and structured streaming) on Amazon EMR and AWS Glue, as well as Amazon Athena SQL. The following section provides a quick look at Iceberg support in Athena SQL.