Considerations and limitations for using Hudi on Amazon EMR

Record key field cannot be null or empty – The field that you specify as the record key field cannot have null or empty values.
Schema updated by default on upsert and insert – Hudi provides an interface, HoodieRecordPayload that determines how the input DataFrame and existing Hudi dataset are merged to produce a new, updated dataset. Hudi provides a default implementation of this class, OverwriteWithLatestAvroPayload, that overwrites existing records and updates the schema as specified in the input DataFrame. To customize this logic for implementing merge and partial updates, you can provide an implementation of the HoodieRecordPayload interface using the DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY parameter.
Deletion requires schema – When deleting, you must specify the record key, the partition key, and the pre-combine key fields. Other columns can be made null or empty, but the full schema is required.
MoR table limitations – MoR tables do not support savepointing. You can query MoR tables using the read-optimized view or the real-time view (tableName_rt) from Spark SQL, Presto, or Hive. Using the read-optimized view only exposes base file data, and does not expose a merged view of base and log data.
Hive
- For registering tables in the Hive metastore, Hudi expects the Hive Thrift server to be running at the default port 10000. If you override this port with a custom port, pass the HIVE_URL_OPT_KEY option as shown in the following example.
```
.option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://localhost:override-port-number
```
- The timestamp data type in Spark is registerd as long data type in Hive, and not as Hive's timestamp type.
Presto
- Presto does not support reading MoR real time tables in Hudi versions below 0.6.0.
- Presto only supports snapshot queries.
- For Presto to correctly interpret Hudi dataset columns, set the hive.parquet_use_column_names value to true.
  - To set the value for a session, in the Presto shell, run the following command:
    
    set session hive.parquet_use_column_names=true
  - To set the value at the cluster level, use the presto-connector-hive configuration classification to set hive.parquet.use_column_names to true, as shown in the following example. For more information, see Configure applications.
    
    [ { "Classification": "presto-connector-hive", "Properties": { "hive.parquet.use-column-names": "true" } } ]
HBase Index
- The HBase version used to build Hudi might be different from what is listed in the EMR Release Guide. To pull in the correct dependencies for your Spark session, run the following command.
```
spark-shell \
--jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/cli/lib/*.jar \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false"
```
Settings for best performance – For EMR 7.3+/Hudi 0.15+, customers are recommended to set this config to reduce Kryo serialization overhead:
```
--conf 'spark.kryo.registrator=org.apache.spark.HoodieKryoRegistrar'
```
Note
If you are using fine-grained access controle (FGAC) on EMR Serverless, this configuration isn't needed, because users must use JavaSerializer rather than KryoSerializer.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

How Hudi works

Create a cluster with Hudi installed

Considerations and limitations for using Hudi on Amazon EMR

Note