Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Supported Hive Versions

You can choose to run Hive in several different configurations. You set the --hive-versions, and --ami-version options in the job creation call as shown in the following table.

The default configuration for Amazon EMR is the latest version of Hive running on the latest AMI version.

The Amazon EMR console does not support Hive versioning and always loads the latest version of Hive.

Versions of the Amazon EMR CLI released on 9 April 2012 and later load the latest version of Hive by default. To use a version of Hive other than the latest, specify the --hive-versions option when you create the cluster. Versions of the Amazon EMR CLI released prior to 9 April 2012 load the default configuration of Hive.

Calls to the API launch the default configuration of Hive, unless you specify the --hive-versions option for the step that loads Hive onto the cluster during the call to RunJobFlow.

Hive VersionCompatible Hadoop VersionsHive Version Notes
0.11.0.2

2.2.0

Introduces the following features and improvements. For more information, see Apache Hive 0.11.0 Release Notes.

  • Adds the Parquet library.

  • Fixes a problem related to the Avro serializer/deserializer accepting a schema URL in Amazon S3.

  • Fixes a problem with Hive returning incorrect results with indexing turned on.

  • Change Hive's log level from DEBUG to INFO.

  • Fixes a problem when tasks do not report progress while deleting files in Amazon S3 dynamic partitions.

  • This Hive version fixes the following issues:

0.11.0.1

1.0.3

2.2.0

  • Creates symlink /home/hadoop/hive/lib/hive_contrib.jar for backward compatibility.

  • Fixes a problem that prevents installation of Hive 0.11.0 with IAM roles.

0.11.0

1.0.3

2.2.0

Introduces the following features and improvements. For more information, see Apache Hive 0.11.0 Release Notes.

  • Simplifies hive.metastore.uris and the hive.metastore.local configuration settings. (HIVE-2585)

  • Changes the internal representation of binary type to byte[]. (HIVE-3246)

  • Allows HiveStorageHandler.configureTableJobProperties() to signal to its handler whether the configuration is input or output. (HIVE-2773)

  • Add environment context to metastore Thrift calls. (HIVE-3252)

  • Adds a new, optimized row columnar file format. (HIVE-3874)

  • Implements TRUNCATE. (HIVE-446)

  • Adds LEAD/LAG/FIRST/LAST analytical windowing functions. (HIVE-896)

  • Adds DECIMAL data type. (HIVE-2693)

  • Supports Hive list bucketing/DML. (HIVE-3073)

  • Supports custom separator for file output. (HIVE-3682)

  • Supports ALTER VIEW AS SELECT. (HIVE-3834)

  • Adds method to retrieve uncompressed/compressed sizes of columns from RC files. (HIVE-3897)

  • Allows updating bucketing/sorting metadata of a partition through the CLI. (HIVE-3903)

  • Allows PARTITION BY/ORDER BY in OVER clause and partition function. (HIVE-4048)

  • Improves GROUP BY syntax. (HIVE-581)

  • Adds more query plan optimization rules. (HIVE-948)

  • Allows CREATE TABLE LIKE command to accept TBLPROPERTIES. (HIVE-3527)

  • Fixes sort-merge join with sub-queries. (HIVE-3633)

  • Supports altering partition column type. (HIVE-3672)

  • De-emphasizes mapjoin hint. (HIVE-3784)

  • Changes object inspectors to initialize based on partition metadata. (HIVE-3833)

  • Adds merge map-job followed by map-reduce job. (HIVE-3952)

  • Optimizes hive.enforce.bucketing and hive.enforce.sorting insert. (HIVE-4240)

0.8.1.81.0.3
0.8.1.71.0.3
  • Fixes ColumnPruner so that it works on LateralView. (HIVE-3226)

  • Fixes utc_from_timestamp and utc_to_timestamp to return correct results. (HIVE- 2803)

  • Fixes a NullPointerException error on a join query with authorization enabled. (HIVE-3225)

  • Improves mapjoin filtering in the ON condition. (HIVE-2101)

  • Preserves the filter on a OUTER JOIN condition while merging the join tree. (HIVE- 3070)

  • Fixes ConcurrentModificationException on a lateral view used with explode. (HIVE- 2540)

  • Fixes an issue where an insert into a table overwrites the existing table, if the table name contains an uppercase character. (HIVE-3062)

  • Fixes an issue where jobs fail when there are multiple aggregates in a query. (HIVE-3732)

  • Fixes a NullPointerException error in nested user-defined aggregation functions (UDAFs). (HIVE-1399)

  • Provides an error message when using a user- defined aggregation function (UDAF) in the place of a user-defined function (UDF). (HIVE-2956)

  • Fixes an issue where Timestamp values without a nano-second part break the following columns in a row. (HIVE- 3090)

  • Fixes an issue where the move task is not picking up changes to hive.exec.max.dynamic.partitions set in the Hive CLI. (HIVE-2918)

  • Adds the ability to atomically add drop partitions from the metastore. (HIVE-2777)

  • Adds partition pruning pushdown to the database for non-string partitions. (HIVE-2702)

  • Adds support for merging small files in Amazon S3 at the end of a map-only job using the hive.merge.mapfiles parameter. If the output path is in Amazon S3, the hive.merge.smallfiles.avgsize setting is ignored. For more information, see Hive File Merge Behavior with Amazon S3 and Hive Configuration Variables.

  • Improves clean-up of junk files after an INSERT OVERWRITE command.

0.8.1.61.0.3
0.8.1.51.0.3
  • Adds support for the new DynamoDB binary data type.

  • Adds the patch Hive-2955, which fixes an issue where queries consisting only of metadata always return an empty value.

  • Adds the patch Hive-1376, which fixes an issue where Hive would crash on an empty result set generated by "where false" clause queries.

  • Fixes the RCFile interaction with Amazon Simple Storage Service (Amazon S3).

  • Replaces JetS3t with the AWS SDK for Java.

  • Uses BatchWriteItem for puts to DynamoDB.

  • Adds schemaless mapping of DynamoDB tables into a Hive table using a Hive map<string, string> column.

0.8.1.41.0.3

Updates the HBase client on Hive clusters to version 0.92.0 to match the version of HBase used on HBase clusters. This fixes issues that occurred when connecting to an HBase cluster from a Hive cluster.

0.8.1.31.0.3

Adds support for Hadoop 1.0.3.

0.8.1.21.0.3, 0.20.205

Fixes an issue with duplicate data in large clusters.

0.8.1.11.0.3, 0.20.205

Adds support for MapR and HBase.

0.8.11.0.3, 0.20.205

Introduces new features and improvements. The most significant of these are as follows. For more information about the changes in Hive 0.8.1, go to Apache Hive 0.8.1 Release Notes.

0.7.1.40.20.205

Prevents the "SET" command in Hive from changing the current database of the current session.

0.7.1.30.20.205

Adds the dynamodb.retry.duration option, which you can use to configure the timeout duration for retrying Hive queries against tables in Amazon DynamoDB. This version of Hive also supports the dynamodb.endpoint option, which you can use to specify the Amazon DynamoDB endpoint to use for a Hive table. For more information about these options, see Hive Options.

0.7.1.20.20.205

Modifies the way files are named in Amazon S3 for dynamic partitions. It prepends file names in Amazon S3 for dynamic partitions with a unique identifier. Using Hive 0.7.1.2 you can run queries in parallel with set hive.exec.parallel=true. It also fixes an issue with filter pushdown when accessing DynamoDB with spare data sets.

0.7.1.10.20.205

Introduces support for accessing DynamoDB, as detailed in Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR. It is a minor version of 0.7.1 developed by the Amazon EMR team. When specified as the Hive version, Hive 0.7.1.1 overwrites the Hive 0.7.1 directory structure and configuration with its own values. Specifically, Hive 0.7.1.1 matches Apache Hive 0.7.1 and uses the Hive server port, database, and log location of 0.7.1 on the cluster.

0.7.10.20.205, 0.20, 0.18

Improves Hive query performance for a large number of partitions and for Amazon S3 queries. Changes Hive to skip commented lines.

0.70.20, 0.18

Improves Recover Partitions to use less memory, fixes the hashCode method, and introduces the ability to use the HAVING clause to filter on groups by expressions.

0.50.20, 0.18

Fixes issues with FileSinkOperator and modifies UDAFPercentile to tolerate null percentiles.

0.40.20, 0.18

Introduces the ability to write to Amazon S3, run Hive scripts from Amazon S3, and recover partitions from table data stored in Amazon S3. Also creates a separate namespace for managing Hive variables.

For more information about the changes in a version of Hive, see Supported Hive Versions. For information about Hive patches and functionality developed by the Amazon EMR team, see Additional Features of Hive in Amazon EMR.

To specify the Hive version when creating the cluster

  • Use the --hive-versions option. The following command-line example creates an interactive Hive cluster running Hadoop 0.20 and Hive 0.7.1.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1

    The --hive-versions option must come after any reference to the options --hive-interactive, --hive-script, or --hive-site.

To specify the latest Hive version when creating the cluster

  • Use the --hive-versions option with the latest keyword. The following command-line example creates an interactive Hive cluster running the latest version of Hive.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions latest 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions latest 

To specify the Hive version for a cluster that is interactive and uses a Hive script

  • If you have a cluster that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following command-line example illustrates setting both the interactive and the script version of Hive to use 0.7.1.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ \
      --name "Testing m1.large AMI 1" \
      --ami-version latest \
      --instance-type m1.large --num-instances 5 \
      --hive-interactive  --hive-versions 0.7.1.2 \
      --hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2 
    • Windows users:

      ruby elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ --name "Testing m1.large AMI --ami-version latest --instance-type m1.large --num-instances 5 --hive-interactive  --hive-versions 0.7.1.2 --hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2 

To load multiple versions of Hive for a given cluster

  • Use the --hive-versions option and separate the version numbers by comma. The following command-line example creates an interactive cluster running Hadoop 0.20 and multiple versions of Hive. With this configuration, you can use any of the installed versions of Hive on the cluster.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.5,0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.5,0.7.1

To call a specific version of Hive

  • Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

Note

If you have multiple versions of Hive loaded on a cluster, calling hive will access the default version of Hive or the version loaded last if there are multiple --hive-versions options specified in the cluster creation call. When the comma-separated syntax is used with --hive-versions to load multiple versions, hive will access the default version of Hive.

Note

When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location.

Display the Hive Version

You can use the --print-hive-version command to display the version of the Hive currently in use for a given cluster. This is a useful command to call after you have upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need to confirm which version is currently running. The syntax for this is as follows, where JobFlowID is the identifier of the cluster to check the Hive version on.

In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow JobFlowID --print-hive-version
  • Windows users:

    ruby elastic-mapreduce --jobflow JobFlowID --print-hive-version