Differences for Hive on Amazon EMR Versions and Default Apache Hive
Differences between Apache Hive on Amazon EMR and Apache Hive
This section describes the differences between Hive on Amazon EMR and the default versions of Hive available at http://svn.apache.org/viewvc/hive/branches/.
Hive Live Long and Process (LLAP) not Supported
LLAP functionality added in version 2.0 of default Apache Hive is not supported in Hive 2.1.0 on Amazon EMR release 5.0.
Differences in Hive Between Amazon EMR Release 4.x and 5.x
This section covers differences to consider before you migrate a Hive implementation from Hive version 1.0.0 on Amazon EMR release 4.x to Hive 2.x on Amazon EMR release 5.x.
Operational Differences and Considerations
Support added for ACID (Atomicity, Consistency, Isolation, and Durability)transactions: This difference between Hive 1.0.0 on Amazon EMR 4.x and default Apache Hive has been eliminated.
Direct writes to Amazon S3 eliminated: This difference between Hive 1.0.0 on Amazon EMR and the default Apache Hive has been eliminated. Hive 2.1.0 on Amazon EMR release 5.x now creates, reads from, and writes to temporary files stored in Amazon S3. As a result, to read from and write to the same table you no longer have to create a temporary table in the cluster's local HDFS file system as a workaround. If you use versioned buckets, be sure to manage these temporary files as described below.
Manage temp files when using Amazon S3 versioned buckets: When you run Hive queries where the destination of generated data is Amazon S3, many temporary files and directories are created. This is new behavior as described earlier. If you use versioned S3 buckets, these temp files clutter Amazon S3 and incur cost if they're not deleted. Adjust your lifecycle rules so that data with a
/_tmpprefix is deleted after a short period, such as five days. See Specifying a Lifecycle Configuration for more information.
Log4j updated to log4j 2: If you use log4j, you may need to change your logging configuration because of this upgrade. See Apache log4j 2 for details.
Performance differences and considerations
Performance differences with Tez: With Amazon EMR release 5.x , Tez is the default execution engine for Hive instead of MapReduce. Tez provides improved performance for most workflows.
ORC file performance: Query performance may be slower than expected for ORC files.
Tables with many partitions: Queries that generate a large number of dynamic partitions may fail, and queries that select from tables with many partitions may take longer than expected to execute. For example, a select from 100,000 partitions may take 10 minutes or more.