Amazon EMR
Developer Guide

How Amazon EMR Hive Differs from Apache Hive

This documentation is for AMI versions 2.x and 3.x of Amazon EMR. For information about Amazon EMR releases 4.0.0 and above, see the Amazon EMR Release Guide. For information about managing the Amazon EMR service in 4.x releases, see the Amazon EMR Management Guide.

This section describes the differences between Amazon EMR Hive installations and the default versions of Hive available at


With Hive 0.13.1 on Amazon EMR, certain options introduced in previous versions of Hive on Amazon EMR have been removed in favor of greater parity with Apache Hive. For example, the -x option was removed.

Combine Splits Input Format

If you have many GZip files in your Hive cluster, you can optimize performance by passing multiple files to each mapper. This reduces the number of mappers needed in your cluster and can help your clusters complete faster. You do this by specifying that Hive use the HiveCombineSplitsInputFormat input format and setting the split size, in bytes. This is shown in the following example

hive> set;
hive> set mapred.min.split.size=100000000;


This functionality was deprecated with Hive 0.13.1. To get the same split input format functionality, use the following:

set hive.hadoop.supports.splittable.combineinputformat=true;

Log files

Apache Hive saves Hive log files to /tmp/{}/ in a file named hive.log. Amazon EMR saves Hive logs to /mnt/var/log/apps/. In order to support concurrent versions of Hive, the version of Hive that you run determines the log file name, as shown in the following table.

Hive VersionLog File Name


Amazon EMR will now use an unversioned hive.log. Minor versions of will all share the same log location as the major version.



Minor versions of Hive 0.11.0, such as, share the same log file location as Hive 0.11.0.



Minor versions of Hive 0.8.1, such as Hive, share the same log file location as Hive 0.8.1.



Minor versions of Hive 0.7.1, such as Hive and Hive, share the same log file location as Hive 0.7.1.


Thrift Service Ports

Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on the following ports.

Hive VersionPort Number
Hive 0.13.110000
Hive 0.11.010004
Hive 0.8.110003
Hive 0.7.110002
Hive 0.710001
Hive 0.510000

For more information about thrift services, go to

Hive Authorization

Amazon EMR supports Hive Authorization for HDFS but not for EMRFS and Amazon S3. Amazon EMR clusters run with authorization disabled by default.

Hive File Merge Behavior with Amazon S3

Apache Hive merges small files at the end of a map-only job if hive.merge.mapfiles is true and the merge is triggered only if the average output size of the job is less than the hive.merge.smallfiles.avgsize setting. Amazon EMR Hive has exactly the same behavior if the final output path is in HDFS; however, if the output path is in Amazon S3, the hive.merge.smallfiles.avgsize parameter is ignored. In that situation, the merge task is always triggered if hive.merge.mapfiles is set to true.

ACID Transactions and Amazon S3

ACID (Atomicity, Consistency, Isolation, Durability) transactions are not supported with Hive data stored in Amazon S3. If you attempt to create a transactional table in Amazon S3, this will cause an exception.