Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Supported Hive Versions

The default configuration for Amazon EMR is the latest version of Hive running on the latest AMI version. The following versions of Hive are available:

Hive VersionCompatible Hadoop VersionsHive Version Notes
0.13.1

2.4.0

Introduces the following features, improvements, and backwards incompatibilities. For more information, see Apache Hive 0.13.1 Release Notes and Apache Hive 0.13.0 Release Notes.

  • Vectorized query: processes thousand-row blocks instead of processing by row.

  • In-memory cache: hot data kept in-memory for quick reads.

  • Faster plan serialization

  • Support for DECIMAL and CHAR datatypes

  • Sub-query for IN, NOT IN, EXISTS and NOT EXISTS (correlated and uncorrelated)

  • JOIN conditions in the WHERE clause

Other feature contributed by Amazon EMR:

  • Includes an optimization to Hive windowing functions that allows them to scale to large data sets.

Notable backward imcompatibilities:

  • Does not support -el flag for pushing error-logs to Amazon S3 bucket in case a query failed.

  • Does not support RECOVER PARTITION syntax. Instead use the native capability, MSCK REPAIR.

  • Round(sum( c ),2) over w1 -> round(sum( c ) over w1,2) (in several places). This syntax was changed in Hive 0.12. See HIVE-4214.

  • Default precision and scale was changed for DECIMAL. Compared to previous Hive versions, DECIMAL in Hive 13 is DECIMAL(10,0).

  • The default SerDe for RCFile-backed tables is LazyBinaryColumnarSerDe in Apache Hive 0.12 and above. This means tables that were created with Hive versions 0.12 or greater will not be able to read data files which were generated with Hive 0.11 correctly unless hive.default.rcfile.serde is set to org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe. See HIVE-4475.

Other notes and known issues:

  • When a Hive database is created with a custom location, the CREATE TABLE AS SELECT (CTAS) operation ignores it. It takes the location from parameter hive.metastore.warehouse.dir instead of the database's properties. See HIVE-3486.

  • When a user loads data into a table using OVERWRITE with a different file it is not being overwritten. See HIVE-6209.

  • Since Amazon EMR uses HiveServer2, the username must be hadoop with no password.

The following patches Hive 0.14.0 patches were backported to this release:

0.11.0.2

1.0.3

2.2.0

Introduces the following features and improvements. For more information, see Apache Hive 0.11.0 Release Notes.

  • Adds the Parquet library.

  • Fixes a problem related to the Avro serializer/deserializer accepting a schema URL in Amazon S3.

  • Fixes a problem with Hive returning incorrect results with indexing turned on.

  • Change Hive's log level from DEBUG to INFO.

  • Fixes a problem when tasks do not report progress while deleting files in Amazon S3 dynamic partitions.

  • This Hive version fixes the following issues:

0.11.0.1

1.0.3

2.2.0

  • Creates symlink /home/hadoop/hive/lib/hive_contrib.jar for backward compatibility.

  • Fixes a problem that prevents installation of Hive 0.11.0 with IAM roles.

0.11.0

1.0.3

2.2.0

Introduces the following features and improvements. For more information, see Apache Hive 0.11.0 Release Notes.

  • Simplifies hive.metastore.uris and the hive.metastore.local configuration settings. (HIVE-2585)

  • Changes the internal representation of binary type to byte[]. (HIVE-3246)

  • Allows HiveStorageHandler.configureTableJobProperties() to signal to its handler whether the configuration is input or output. (HIVE-2773)

  • Add environment context to metastore Thrift calls. (HIVE-3252)

  • Adds a new, optimized row columnar file format. (HIVE-3874)

  • Implements TRUNCATE. (HIVE-446)

  • Adds LEAD/LAG/FIRST/LAST analytical windowing functions. (HIVE-896)

  • Adds DECIMAL data type. (HIVE-2693)

  • Supports Hive list bucketing/DML. (HIVE-3073)

  • Supports custom separator for file output. (HIVE-3682)

  • Supports ALTER VIEW AS SELECT. (HIVE-3834)

  • Adds method to retrieve uncompressed/compressed sizes of columns from RC files. (HIVE-3897)

  • Allows updating bucketing/sorting metadata of a partition through the CLI. (HIVE-3903)

  • Allows PARTITION BY/ORDER BY in OVER clause and partition function. (HIVE-4048)

  • Improves GROUP BY syntax. (HIVE-581)

  • Adds more query plan optimization rules. (HIVE-948)

  • Allows CREATE TABLE LIKE command to accept TBLPROPERTIES. (HIVE-3527)

  • Fixes sort-merge join with sub-queries. (HIVE-3633)

  • Supports altering partition column type. (HIVE-3672)

  • De-emphasizes mapjoin hint. (HIVE-3784)

  • Changes object inspectors to initialize based on partition metadata. (HIVE-3833)

  • Adds merge map-job followed by map-reduce job. (HIVE-3952)

  • Optimizes hive.enforce.bucketing and hive.enforce.sorting insert. (HIVE-4240)

0.8.1.81.0.3
0.8.1.71.0.3
  • Fixes ColumnPruner so that it works on LateralView. (HIVE-3226)

  • Fixes utc_from_timestamp and utc_to_timestamp to return correct results. (HIVE- 2803)

  • Fixes a NullPointerException error on a join query with authorization enabled. (HIVE-3225)

  • Improves mapjoin filtering in the ON condition. (HIVE-2101)

  • Preserves the filter on a OUTER JOIN condition while merging the join tree. (HIVE- 3070)

  • Fixes ConcurrentModificationException on a lateral view used with explode. (HIVE- 2540)

  • Fixes an issue where an insert into a table overwrites the existing table, if the table name contains an uppercase character. (HIVE-3062)

  • Fixes an issue where jobs fail when there are multiple aggregates in a query. (HIVE-3732)

  • Fixes a NullPointerException error in nested user-defined aggregation functions (UDAFs). (HIVE-1399)

  • Provides an error message when using a user- defined aggregation function (UDAF) in the place of a user-defined function (UDF). (HIVE-2956)

  • Fixes an issue where Timestamp values without a nano-second part break the following columns in a row. (HIVE- 3090)

  • Fixes an issue where the move task is not picking up changes to hive.exec.max.dynamic.partitions set in the Hive CLI. (HIVE-2918)

  • Adds the ability to atomically add drop partitions from the metastore. (HIVE-2777)

  • Adds partition pruning pushdown to the database for non-string partitions. (HIVE-2702)

  • Adds support for merging small files in Amazon S3 at the end of a map-only job using the hive.merge.mapfiles parameter. If the output path is in Amazon S3, the hive.merge.smallfiles.avgsize setting is ignored. For more information, see Hive File Merge Behavior with Amazon S3 and Hive Configuration Variables.

  • Improves clean-up of junk files after an INSERT OVERWRITE command.

0.8.1.61.0.3
0.8.1.51.0.3
  • Adds support for the new DynamoDB binary data type.

  • Adds the patch Hive-2955, which fixes an issue where queries consisting only of metadata always return an empty value.

  • Adds the patch Hive-1376, which fixes an issue where Hive would crash on an empty result set generated by "where false" clause queries.

  • Fixes the RCFile interaction with Amazon Simple Storage Service (Amazon S3).

  • Replaces JetS3t with the AWS SDK for Java.

  • Uses BatchWriteItem for puts to DynamoDB.

  • Adds schemaless mapping of DynamoDB tables into a Hive table using a Hive map<string, string> column.

0.8.1.41.0.3

Updates the HBase client on Hive clusters to version 0.92.0 to match the version of HBase used on HBase clusters. This fixes issues that occurred when connecting to an HBase cluster from a Hive cluster.

0.8.1.31.0.3

Adds support for Hadoop 1.0.3.

0.8.1.21.0.3, 0.20.205

Fixes an issue with duplicate data in large clusters.

0.8.1.11.0.3, 0.20.205

Adds support for MapR and HBase.

0.8.11.0.3, 0.20.205

Introduces new features and improvements. The most significant of these are as follows. For more information about the changes in Hive 0.8.1, go to Apache Hive 0.8.1 Release Notes.

0.7.1.40.20.205

Prevents the "SET" command in Hive from changing the current database of the current session.

0.7.1.30.20.205

Adds the dynamodb.retry.duration option, which you can use to configure the timeout duration for retrying Hive queries against tables in Amazon DynamoDB. This version of Hive also supports the dynamodb.endpoint option, which you can use to specify the Amazon DynamoDB endpoint to use for a Hive table. For more information about these options, see Hive Options.

0.7.1.20.20.205

Modifies the way files are named in Amazon S3 for dynamic partitions. It prepends file names in Amazon S3 for dynamic partitions with a unique identifier. Using Hive 0.7.1.2 you can run queries in parallel with set hive.exec.parallel=true. It also fixes an issue with filter pushdown when accessing DynamoDB with spare data sets.

0.7.1.10.20.205

Introduces support for accessing DynamoDB, as detailed in Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR. It is a minor version of 0.7.1 developed by the Amazon EMR team. When specified as the Hive version, Hive 0.7.1.1 overwrites the Hive 0.7.1 directory structure and configuration with its own values. Specifically, Hive 0.7.1.1 matches Apache Hive 0.7.1 and uses the Hive server port, database, and log location of 0.7.1 on the cluster.

0.7.10.20.205, 0.20, 0.18

Improves Hive query performance for a large number of partitions and for Amazon S3 queries. Changes Hive to skip commented lines.

0.70.20, 0.18

Improves Recover Partitions to use less memory, fixes the hashCode method, and introduces the ability to use the HAVING clause to filter on groups by expressions.

0.50.20, 0.18

Fixes issues with FileSinkOperator and modifies UDAFPercentile to tolerate null percentiles.

0.40.20, 0.18

Introduces the ability to write to Amazon S3, run Hive scripts from Amazon S3, and recover partitions from table data stored in Amazon S3. Also creates a separate namespace for managing Hive variables.

For more information about the changes in a version of Hive, see Supported Hive Versions. For information about Hive patches and functionality developed by the Amazon EMR team, see Additional Features of Hive in Amazon EMR.

With the Amazon EMR CLI, you can specify a specific version of Hive to install using the --hive-versions option, or you can choose to install the latest version.

The AWS CLI does not support installing specific Hive versions. When using the AWS CLI, the latest version of Hive included on the AMI is installed by default.

To specify the Hive version when creating the cluster using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Use the --hive-versions option. The --hive-versions option must come after any reference to the options --hive-interactive, --hive-script, or --hive-site.

    The following command-line example creates an interactive Hive cluster running Hadoop 0.20 and Hive 0.7.1. In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1

To specify the latest Hive version when creating the cluster using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Use the --hive-versions option with the latest keyword. The following command-line example creates an interactive Hive cluster running the latest version of Hive.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions latest 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions latest 

To specify the Hive version using the Amazon EMR CLI for a cluster that is interactive and uses a Hive script

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • If you have a cluster that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following command-line example illustrates setting both the interactive and the script version of Hive to use 0.7.1.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --debug --log-uri s3://mybucket/logs/ \
      --name "Testing m1.large AMI 1" \
      --ami-version latest \
      --instance-type m1.large --num-instances 5 \
      --hive-interactive  --hive-versions 0.7.1.2 \
      --hive-script s3://mybucket/hive-script.hql --hive-versions 0.7.1.2
    • Windows users:

      ruby elastic-mapreduce --create --debug --log-uri s3://mybucket/logs/ --name "Testing m1.large AMI" --ami-version latest --instance-type m1.large --num-instances 5 --hive-interactive  --hive-versions 0.7.1.2 --hive-script s3://mybucket/hive-script.hql --hive-versions 0.7.1.2 

To load multiple versions of Hive for a cluster using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Use the --hive-versions option and separate the version numbers by comma. The following command-line example creates an interactive cluster running Hadoop 0.20 and multiple versions of Hive. With this configuration, you can use any of the installed versions of Hive on the cluster.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Test Hive" \
      --num-instances 5 --instance-type m1.large \
      --hive-interactive \
      --hive-versions 0.5,0.7.1
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Test Hive" --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.5,0.7.1

To call a specific version of Hive using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

Note

If you have multiple versions of Hive loaded on a cluster, calling hive will access the default version of Hive or the version loaded last if there are multiple --hive-versions options specified in the cluster creation call. When the comma-separated syntax is used with --hive-versions to load multiple versions, hive will access the default version of Hive.

Note

When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location.

Display the Hive Version

You can view the version of Hive installed on your cluster using the console or the Amazon EMR CLI. In the console, the Hive version is displayed on the Cluster Details page. In the Configuration Details column, the Applications field displays the Hive version.

Using the Amazon EMR CLI, type the --print-hive-version command to display the version of the Hive currently in use for a given cluster. This is a useful command to call after you have upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need to confirm which version is currently running. The syntax for this is as follows, where JobFlowID is the identifier of the cluster.

To display the Hive version using the CLI, in the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --jobflow JobFlowID --print-hive-version
  • Windows users:

    ruby elastic-mapreduce --jobflow JobFlowID --print-hive-version