Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

How Amazon EMR Hive Differs from Apache Hive

This section describes the differences between Amazon EMR Hive installations and the default versions of Hive available at http://svn.apache.org/viewvc/hive/branches/.

Note

With Hive 0.13.1 on Amazon EMR, certain options introduced in previous versions of Hive on EMR have been removed in favor of greater parity with Apache Hive. For example, the -x option was removed.

Input Format

The Apache Hive default input format is text. The Amazon EMR default input format for Hive is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. You can specify the hive.base.inputformat option in Hive to select a different file format, for example:

hive>set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;

To switch back to the default Amazon EMR input format, you would enter the following:

hive>set hive.base.inputformat=default;

Combine Splits Input Format

If you have many GZip files in your Hive cluster, you can optimize performance by passing multiple files to each mapper. This reduces the number of mappers needed in your cluster and can help your clusters complete faster. You do this by specifying that Hive use the HiveCombineSplitsInputFormat input format and setting the split size, in bytes. This is shown in the following example.

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveCombineSplitsInputFormat;
hive> set mapred.min.split.size=100000000;
			

Note

This functionality was deprecated with Hive 0.13.1. To get the same split input format functionality you would use the following:

set hive.hadoop.supports.splittable.combineinputformat=true;

Log files

Apache Hive saves Hive log files to /tmp/{user.name}/ in a file named hive.log. Amazon EMR saves Hive logs to /mnt/var/log/apps/. In order to support concurrent versions of Hive, the version of Hive you run determines the log file name, as shown in the following table.

Hive VersionLog File Name
0.13.1hive.log

Note

Amazon EMR will now use an unversioned hive.log. Minor versions of will all share the same log location as the major version.

0.11.0hive_0110.log

Note

Minor versions of Hive 0.11.0, such as 0.11.0.1, share the same log file location as Hive 0.11.0.

0.8.1hive_081.log

Note

Minor versions of Hive 0.8.1, such as Hive 0.8.1.1, share the same log file location as Hive 0.8.1.

0.7.1hive_07_1.log

Note

Minor versions of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1.

0.7hive_07.log
0.5hive_05.log
0.4hive.log

Thrift Service Ports

Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on the following ports:

Hive VersionPort Number
Hive 0.13.110000
Hive 0.11.010004
Hive 0.8.110003
Hive 0.7.110002
Hive 0.710001
Hive 0.510000

For more information about thrift services, go to http://wiki.apache.org/thrift/.

Hive Authorization

Amazon EMR does not support Hive Authorization. Amazon EMR clusters run with authorization disabled. You cannot use Hive authorization in your Amazon EMR cluster.

Hive File Merge Behavior with Amazon S3

Apache Hive merges small files at the end of a map-only job if hive.merge.mapfiles is true and the merge is triggered only if the average output size of the job is less than the hive.merge.smallfiles.avgsize setting. Amazon EMR Hive has exactly the same behavior if the final output path is in HDFS, however if the output path is in S3, the hive.merge.smallfiles.avgsize parameter is ignored. In that situation, the merge task is always triggered if hive.merge.mapfiles is set to true.