Hudi
Apache Hudi
Hudi is integrated with Apache Spark
With Amazon EMR release version 5.28.0 and later, EMR installs Hudi components by default when Spark, Hive, or Presto are installed. You can use Spark or the Hudi DeltaStreamer utility to create or update Hudi datasets. You can use Hive, Spark, or Presto to query a Hudi dataset interactively or build data processing pipelines using incremental pull. Incremental pull refers to the ability to pull only the data that changed between two actions.
These features make Hudi suitable for the following use cases:
-
Working with streaming data from sensors and other Internet of Things (IoT) devices that require specific data insertion and update events.
-
Complying with data privacy regulations in applications where users might choose to be forgotten or modify their consent for how their data can be used.
-
Implementing a change data capture (CDC) system
that allows you to apply changes to a dataset over time.
The version of Hudi installed with Amazon EMR 5.32.0 is 0.6.0-amzn-0.
Topics