Apache Hudi and Lake Formation - Amazon EMR

Apache Hudi and Lake Formation

Amazon EMR release 6.9.0 and later includes limited support for access control based on Lake Formation with Apache Hudi when reading data using Spark SQL. Amazon EMR supports SELECT queries with Spark SQL, and is limited to column-level access control. With this feature, you can now run the following:

  • Snapshot queries on copy-on-write tables to query the latest snapshot of the table at a given commit or compaction instant.

  • Read-optimized queries on merge-on-read tables to query the latest compacted data, which might not include the freshest updates in the log files that haven't been compacted yet.

The following support matrix lists some core features of Apache Hudi with Lake Formation:

Copy on Write Merge on Read

Snapshot Queries - Spark SQL

Y

N

Read Optimized Queries - Spark SQL

Not Applicable

Y

Incremental Queries

N

N

Time Travel Queries

N

N

Spark Datasource Queries

N

N

Spark Datasource Writes

N

N

DML/DDL

N

N

Metadata Table

N

N

Querying Hudi tables

This section shows how you can run the supported queries described above on a Lake Formation enabled cluster. The table should be a registered catalog table.

This section shows how to run the queries that are supported on a Lake Formation cluster, as indicated previously

  1. To start the Spark shell, use the following commands.

    spark-shell --jars /usr/lib/hudi/hudi-spark-bundle.jar \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
    spark-sql --jars /usr/lib/hudi/hudi-spark-bundle.jar \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
  2. To query the latest snapshot of copy-on-write tables, use the following commands.

    select * from <my_hudi_cow_table>
    spark.read.table("<my_hudi_cow_table>")
  3. To query the latest compacted data of MOR tables, you can query the read optimized table which is suffixed with _ro:

    SELECT * from <my_hudi_mor_table>_ro
    spark.read.table("<my_hudi_mor_table>_ro")
Note

The performance of reads on Lake Formation clusters might be slower because of optimizations that are not supported. These features include file listing based on Hudi metadata, and data skipping. We recommend that you test your application performance to ensure that it meets your SLA.