Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Supported Hive Versions

You can choose to run Hive in several different configurations. You set the --hadoop-version, --hive-versions, and --ami-version parameters in the job creation call as shown in the following table.

The default configuration for Amazon EMR is the latest version of Hive running on the latest AMI version.

The Amazon EMR console does not support Hive versioning and always loads the latest version of Hive.

Versions of the Amazon EMR CLI released on 9 April 2012 and later load the latest version of Hive by default. To use a version of Hive other than the latest, specify the --hive-versions parameter when you create the cluster. Versions of the Amazon EMR CLI released prior to 9 April 2012 load the default configuration of Hive.

Calls to the API will launch the default configuration of Hive, unless you specify --hive-versions as an argument to the step that loads Hive onto the cluster during the call to RunJobFlow.

Hive VersionCompatible Hadoop VersionsHive Version Notes
0.8.1.71.0.3
  • Fixes ColumnPruner so that it works on LateralView. (HIVE-3226)

  • Fixes utc_from_timestamp and utc_to_timestamp to return correct results. (HIVE- 2803)

  • Fixes a NullPointerException error on a join query with authorization enabled. (HIVE-3225)

  • Improves mapjoin filtering in the ON condition. (HIVE-2101)

  • Preserves the filter on a OUTER JOIN condition while merging the join tree. (HIVE- 3070)

  • Fixes ConcurrentModificationException on a lateral view used with explode. (HIVE- 2540)

  • Fixes an issue where an insert into a table overwrites the existing table, if the table name contains an uppercase character. (HIVE-3062)

  • Fixes an issue where jobs fail when there are multiple aggregates in a query. (HIVE-3732)

  • Fixes a NullPointerException error in nested user-defined aggregation functions (UDAFs). (HIVE-1399)

  • Provides an error message when using a user- defined aggregation function (UDAF) in the place of a user-defined function (UDF). (HIVE-2956)

  • Fixes an issue where Timestamp values without a nano-second part break the following columns in a row. (HIVE- 3090)

  • Fixes an issue where the move task is not picking up changes to hive.exec.max.dynamic.partitions set in the Hive CLI. (HIVE-2918)

  • Adds the ability to atomically add drop partitions from the metastore. (HIVE-2777)

  • Adds partition pruning pushdown to the database for non-string partitions. (HIVE-2702)

  • Adds support for merging small files in Amazon S3 at the end of a map-only job using the hive.merge.mapfiles parameter. If the output path is in Amazon S3, the hive.merge.smallfiles.avgsize setting is ignored. For more information, see Hive File Merge Behavior with Amazon S3 and Hive Configuration Variables.

  • Improves clean-up of junk files after an INSERT OVERWRITE command.

0.8.1.61.0.3
0.8.1.51.0.3
  • Adds support for the new Amazon DynamoDB binary data type.

  • Adds the patch Hive-2955, which fixes an issue where queries consisting only of metadata always return an empty value.

  • Adds the patch Hive-1376, which fixes an issue where Hive would crash on an empty result set generated by "where false" clause queries.

  • Fixes the RCFile interaction with Amazon Simple Storage Service (Amazon S3).

  • Replaces JetS3t with the AWS SDK for Java.

  • Uses BatchWriteItem for puts to Amazon DynamoDB.

  • Adds schemaless mapping of Amazon DynamoDB tables into a Hive table using a Hive map<string, string> column.

0.8.1.41.0.3

Updates the HBase client on Hive clusters to version 0.92.0 to match the version of HBase used on HBase clusters. This fixes issues that occurred when connecting to an HBase cluster from a Hive cluster.

0.8.1.31.0.3

Adds support for Hadoop 1.0.3.

0.8.1.21.0.3, 0.20.205

Fixes an issue with duplicate data in large clusters.

0.8.1.11.0.3, 0.20.205

Adds support for MapR and HBase.

0.8.11.0.3, 0.20.205

Introduces new features and improvements. The most significant of these are as follows. For complete information about the changes in Hive 0.8.1, go to the Apache Hive 0.8.1 Release Notes.

0.7.1.40.20.205

Prevents the "SET" command in Hive from changing the current database of the current session.

0.7.1.30.20.205

Adds the dynamodb.retry.duration option, which you can use to configure the timeout duration for retrying Hive queries against tables in Amazon DynamoDB. This version of Hive also supports the dynamodb.endpoint option, which you can use to specify the Amazon DynamoDB endpoint to use for a Hive table. For more information about these options, see Hive Options.

0.7.1.20.20.205

Modifies the way files are named in Amazon S3 for dynamic partitions. It prepends file names in Amazon S3 for dynamic partitions with a unique identifier. Using Hive 0.7.1.2 you can run queries in parallel with set hive.exec.parallel=true. It also fixes an issue with filter pushdown when accessing Amazon DynamoDB with spare data sets.

0.7.1.10.20.205

Introduces support for accessing Amazon DynamoDB, as detailed in Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR. It is a minor version of 0.7.1 developed by the Amazon EMR team. When specified as the Hive version, Hive 0.7.1.1 overwrites the Hive 0.7.1 directory structure and configuration with its own values. Specifically, Hive 0.7.1.1 matches Apache Hive 0.7.1 and uses the Hive server port, database, and log location of 0.7.1 on the cluster.

0.7.10.20.205, 0.20, 0.18

Improves Hive query performance for a large number of partitions and for Amazon S3 queries. Changes Hive to skip commented lines.

0.70.20, 0.18

Improves Recover Partitions to use less memory, fixes the hashCode method, and introduces the ability to use the HAVING clause to filter on groups by expressions.

0.50.20, 0.18

Fixes issues with FileSinkOperator and modifies UDAFPercentile to tolerate null percentiles.

0.40.20, 0.18

Introduces the ability to write to Amazon S3, run Hive scripts from Amazon S3, and recover partitions from table data stored in Amazon S3. Also creates a separate namespace for managing Hive variables.

For additional details about the changes in a version of Hive, go to Supported Hive Versions. For information about Hive patches and functionality developed by the Amazon EMR team, go to Additional Features of Hive in Amazon EMR.

To specify the Hive version when creating the cluster

  • Use the --hive-versions parameter. The following command-line example creates an interactive Hive cluster running Hadoop 0.20 and Hive 0.7.1.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --hadoop-version 0.20 \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1

    The --hive-versions parameter must come after any reference to the parameters --hive-interactive, --hive-script, or --hive-site.

To specify the latest Hive version when creating the cluster

  • Use the --hive-versions parameter with the latest keyword. The following command-line example creates an interactive Hive cluster running the latest version of Hive.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --hadoop-version 0.20 \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions latest 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions latest 

To specify the Hive version for a cluster that is interactive and uses a Hive script

  • If you have a cluster that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following command-line example illustrates setting both the interactive and the script version of Hive to use 0.7.1.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ \
      --name "Testing m1.large AMI 1" \
      --ami-version latest --hadoop-version 0.20 \
      --instance-type m1.large --num-instances 5 \
      --hive-interactive  --hive-versions 0.7.1.2 \
      --hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2 
    • Windows users:

      ruby elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ --name "Testing m1.large AMI --ami-version latest --hadoop-version 0.20 --instance-type m1.large --num-instances 5 --hive-interactive  --hive-versions 0.7.1.2 --hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2 

To load multiple versions of Hive for a given cluster

  • Use the --hive-versions parameter and separate the version numbers by comma. The following command-line example creates an interactive cluster running Hadoop 0.20 and multiple versions of Hive. With this configuration, you can use any of the installed versions of Hive on the cluster.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --hadoop-version 0.20 \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.5,0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.5,0.7.1

To call a specific version of Hive

  • Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

Note

If you have multiple versions of Hive loaded on a cluster, calling hive will access the default version of Hive or the version loaded last if there are multiple --hive-versions parameters specified in the cluster creation call. When the comma-separated syntax is used with --hive-versions to load multiple versions, hive will access the default version of Hive.

Note

When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location.

Display the Hive Version

You can use the --print-hive-version command to display the version of the Hive currently in use for a given cluster. This is a useful command to call after you have upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need to confirm which version is currently running. The syntax for this is as follows, where JobFlowID is the identifier of the cluster to check the Hive version on.

In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow JobFlowID --print-hive-version
  • Windows users:

    ruby elastic-mapreduce --jobflow JobFlowID --print-hive-version