| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Topics
This section describes the differences between Amazon EMR Hive installations and the default versions of Hive available at http://svn.apache.org/viewvc/hive/branches/.
The Apache Hive default input format is text. The Amazon EMR default input
format for Hive is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. You
can specify the hive.base.inputformat option in Hive to select a
different file format, for example:
hive>set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
To switch back to the default Amazon EMR input format, you would enter the following:
hive>set hive.base.inputformat=default;
If you have many GZip files in your Hive cluster, you can optimize performance by passing multiple files to each mapper.
This reduces the number of mappers needed in your cluster and can help your clusters complete faster. You do this by specifying that
Hive use the HiveCombineSplitsInputFormat input format and setting the split size, in bytes. This is shown in the following example.
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveCombineSplitsInputFormat;
hive> set mapred.min.split.size=100000000;
Note
This input format was added with Hive 0.8.1 and is available only in clusters running Hive 0.8.1 or later.
Apache Hive saves Hive log files to /tmp/{user.name}/ in a file named
hive.log. Amazon EMR saves Hive logs to
/mnt/var/log/apps/. In order to support concurrent versions of Hive, the
version of Hive you run determines the log file name, as shown in the following table.
| Hive Version | Log File Name |
|---|---|
| 0.4 | hive.log |
| 0.5 | hive_05.log |
| 0.7 | hive_07.log |
| 0.7.1 | hive_07_1.log
Note Minor versions of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1. |
| 0.8.1 | hive_081.log
Note Minor versions of Hive 0.8.1, such as Hive 0.8.1.1, share the same log file location as Hive 0.8.1. |
Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on the following ports:
| Hive Version | Port Number |
|---|---|
| Hive 0.5 | 10000 |
| Hive 0.7 | 10001 |
| Hive 0.7.1 | 10002 |
| Hive 0.8.1 | 10003 |
For more information about thrift services, go to http://wiki.apache.org/thrift/.
Amazon EMR does not support Hive authorization. Amazon EMR clusters run with authorization disabled. You cannot use Hive authorization in your Amazon EMR cluster.
Apache Hive merges small files at the end of a map-only job if hive.merge.mapfiles is true and the merge is triggered only if the average output size of the job is less than the hive.merge.smallfiles.avgsize setting.
Amazon EMR Hive has exactly the same behavior if the final output path is in HDFS, however if the output path is in S3, the hive.merge.smallfiles.avgsize parameter is ignored. In that situation, the merge task is always triggered if hive.merge.mapfiles is set to true.