Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Additional Features of Hive in Amazon EMR

We have extended Hive with new features that integrate Hive with Amazon Web Services (AWS), such as the ability to read and write from Amazon Simple Storage Service (Amazon S3). For information about which versions of Hive support these additional features, see Hive Patches.

Write Data Directly to Amazon S3

The Hadoop Distributed File System (HDFS) and Amazon S3 are handled differently within Amazon EMR and Hive. The version of Hive installed with Amazon EMR is extended with the ability to write directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and Amazon S3 behave differently within Hive.

A consequence of Hive writing directly to Amazon S3 is that you cannot read and write to the same table within the same Hive statement if that table is located in Amazon S3. The following example shows how to use multiple Hive statements to update a table in Amazon S3.

To update a table in Amazon S3 using Hive

  1. From a Hive prompt or script, create a temporary table in the cluster's local HDFS filesystem.

  2. Write the results of a Hive query to the temporary table.

  3. Copy the contents of the temporary table to Amazon S3. This is shown in the following example.

    create temporary table tmp like my_s3_table ;
    insert overwrite tmp select .... ;
    insert overwrite my_s3_table select * from tmp ;
                    

Use Hive to Access Resources in Amazon S3

The version of Hive installed in Amazon EMR enables you to reference resources, such as JAR files, located in Amazon S3.

add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar 
         

You can also reference scripts located in Amazon S3 to execute custom map and reduce operations. This is shown in the following example.

from logs select transform (line) 
using 's3://mybucket/scripts/parse-logs.pl' as 
(time string, exception_type string, exception_details string)
         

The ability to initialize Hive from a file stored in Amazon S3 was introduced with Hive 0.8.1. Versions of Hive prior to 0.8.1 do not support initializing Hive from Amazon S3. For example, in the following Hive command, -i s3n://myawsbucket/hive_init.sql succeeds if run with Hive 0.8.1 or later, and fails if run with an earlier version of Hive.

hive -i s3n://myawsbucket/hive_init.sql -f s3n://myawsbucket/hive_example.sql 
        

Use Hive to Recover Partitions

We added a statement to the Hive query language that recovers the partitions of a table from table data located in Amazon S3. The following example shows this.

create external table (json string) raw_impression 
partitioned by (dt string) 
location 's3://elastic-mapreduce/samples/hive-ads/tables/impressions'
;
alter table logs recover partitions ; 
	    

The partition directories and data must be at the location specified in the table definition and must be named according to the Hive convention: for example, dt=2009-01-01.

Variables in Hive

You can include variables in your scripts by using the dollar sign and curly braces.

add jar ${LIB}/jsonserde.jar
        

You pass the values of these variables to Hive on the command line using the -d parameter, as the following example shows.

-d LIB=s3://elasticmapreduce/samples/hive-ads/lib
        

You can also pass the values into steps that execute Hive scripts.

To pass variable values into steps with the CLI

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --hive-script --arg s3://mybucket/script.q \
      --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib
    • Windows users:

      ruby elastic-mapreduce --hive-script --arg s3://mybucket/script.q --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib

To pass variable values into steps with the SDK

  • The following example demonstrates how to pass variables into steps using the SDK. For more information, see Class StepFactory in the AWS SDK for Java API Reference.

    StepFactory stepFactory = new StepFactory();
    
       StepConfig runHive = new StepConfig()
         .withName("Run Hive Script")
         .withActionOnFailure("TERMINATE_JOB_FLOW")
         .withHadoopJarStep(stepFactory.newRunHiveScriptStep(“s3://mybucket/script.q”,
          Lists.newArrayList(“-d”,”LIB= s3://elasticmapreduce/samples/hive-ads/lib”));
    

Make JDBC Connections in Hive

When you start an interactive Hive session using either the EMR console or the CLI, a Hive server starts on the master node and installs Hive in a cluster. The Hive server accepts JDBC connections from the Hive JDBC driver on port 10000.

To establish a connection from a remote machine

  1. Start an SSH tunnel.

    ssh -i my_private_key.pem hadoop@<master_node> -N -L 1234:localhost:10000 
                    

    Replace <master_node> with the DNS name of the master node of your cluster. Alternatively, you can establish an SSH tunnel using Java secure channel (JSch).

  2. Connect to the Hive server using the JDBC connection string.

    jdbc:hive://localhost:1234/default
                    

    Alternatively, you can connect from a machine running in Amazon EC2 that is either in the ElasticMapReduce-master or ElasticMapReduce-slave security group.

Persist Hive Schema

By default, Hive keeps its schema information on the master node and that information ceases to exist when the cluster terminates. You can use the hive-site.xml feature to override the default location of the metadata store and replace it with a location that persists. In the following example, the default location is replaced by a MySQL instance that is already running in Amazon EC2.

To override the default location of the metadata store

  1. Create a Hive site configuration file and store it in Amazon S3 so that it can override the location of the metadata store.

      <configuration>
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://ec2-72-44-33-189.compute-1.amazonaws.com:3306/hive?user=user12&password=abababa7&create=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
      </property>
      
      <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
      </property>
      </configuration>
            

    In this example, the Hive site configuration file is in the following Amazon S3 location:

    s3://mybucket/config/hive-site.conf
            

  2. In the directory where you installed the Amazon EMR CLI, use hive-site to install the configuration file in a cluster by running the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --jobflow $JOBFLOW \
      --hive-site=s3://mybucket/conf/hive-site.xml
    • Windows users:

      ruby elastic-mapreduce --jobflow $JOBFLOW --hive-site=s3://mybucket/conf/hive-site.xml

Amazon EMR Hive Steps

The Amazon EMR CLI provides a convenient way to access Hive steps. You can also access Hive steps from programs that call directly into the Amazon EMR web service.

To access Hive steps from the CLI

  • Run the --describe command of the install step on a cluster that has Hive installed on it.

      "StepConfig": {
        "ActionOnFailure": "TERMINATE_JOB_FLOW",
        "Name": "Setup Hive",
        "HadoopJarStep": {
           "MainClass": null,
           "Jar": "s3:\/\/elasticmapreduce\/libs\/script-runner\/script-runner.jar",
           "Args": [
             "s3:\/\/elasticmapreduce\/libs\/hive\/0.4\/install-hive"
            ],
           "Properties": []
        }
      }
            

In this example, you can see that a custom JAR called script-runner executes a script called install-hive, which resides in Amazon S3.

Notice that the install scripts are region-specific. If you're launching a cluster in eu-west-1 for example, you should include the installed script in the bucket eu-west-1.elasticmapreduce rather than the bucket us-east-1.elasticmapreduce.

Amazon EMR Hive queries to accommodate partial DynamoDB schemas

Amazon EMR Hive provides maximum flexibility when querying DynamoDB tables by allowing you to specify a subset of columns on which you can filter data, rather than requiring your query to include all columns. This partial schema query technique is effective when you have a sparse database schema and want to filter records based on a few columns, such as filtering on time stamps.

The following example shows how to use a Hive query to:

  • Create a DynamoDB table.

  • Select a subset of items (rows) in DynamoDB and further narrow the data to certain columns.

  • Copy the resulting data to Amazon S3.

DROP TABLE dynamodb; 
DROP TABLE s3;

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.throughput.read.percent" = ".1000", 
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey"); 

CREATE EXTERNAL TABLE s3(map<String, String>)
     ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
     LOCATION 's3://bucketname/path/subpath/';

INSERT OVERWRITE s3 SELECT item fullColumn FROM dynamodb WHERE recordTimeStamp < "2012-01-01";

The following table shows the query syntax for selecting any combination of items from DynamoDB.

Query ExampleResult Description
SELECT * FROM table_name;Selects all items (rows) from a given table and includes data from all columns available for those items.
SELECT * FROM table_name WHERE field_name =value;Selects some items (rows) from a given table and includes data from all columns available for those items.
SELECT column1_name, column2_name, column3_name FROM table_name;Selects all items (rows) from a given table and includes data from some columns available for those items.
SELECT column1_name, column2_name, column3_name FROM table_name WHERE field_name =value;Selects some items (rows) from a given table and includes data from some columns available for those items.

Copy data between DynamoDB tables in different AWS regions

Amazon EMR Hive provides a dynamodb.region property you can set per DynamoDB table. When dynamodb.region is set differently on two tables, any data you copy between the tables automatically occurs between the specified regions.

The following example shows you how to create a DynamoDB table with a Hive script that sets the dynamodb.region property:

Note

Per-table region properties override the global Hive properties. For more information, see Hive Options.

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.region" = "eu-west-1", 
     "dynamodb.throughput.read.percent" = ".1000", 
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");

Set DynamoDB throughput values per table

Amazon EMR Hive enables you to set the DynamoDB readThroughputPercent and writeThroughputPercent settings on a per table basis in the table definition. The following Amazon EMR Hive script shows how to set the throughput values. For more information about DynamoDB throughput values, see Specifying Read and Write Requirements for Tables.

Note

These per table throughput properties will override the global Hive properties as mentioned in this section: Hive Options.

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.throughput.read.percent" = ".4",
     "dynamodb.throughput.write.percent" = "1.0",
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");

Hive Patches

The Amazon EMR team has created the following patches for Hive.

PatchDescription
Write to Amazon S3

Supports moving data between different file systems, such as HDFS and Amazon S3. Adds support for file systems (such as Amazon S3) that do not provide a “move” operation. Removes redundant operations like moving data to and from the same location.

Status: Submitted

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a (HIVE-2318)

Scripts in Amazon S3

Enables Hive to download the Hive scripts in Amazon S3 buckets and run them. Saves you the step of copying scripts to HDFS before running them.

Status: Committed

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: 0.7.0 (HIVE-1624)

Recover partitions

Allows you to recover partitions from table data located in Amazon S3 and Hive table data in HDFS.

Status: Not Submitted

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a

Variables in Hive

Create a separate namespace (aside from HiveConf) for managing Hive variables. Adds support for setting variables on the command line using either '-define x=y' or 'set hivevar:x=y'. Adds support for referencing variables in statements using '${var_name}'. Provides a means for differentiating between hiveconf, hivevar, system, and environment properties in the output of 'set -v'.

Status: Committed

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: 0.8.0 (HIVE-2020)

Report progress while writing to Amazon S3

FileSinkOperator reports progress to Hadoop while writing large files, so that the task is not killed.

Status: Not Submitted

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: n/a

Fix compression arguments

Corrects an issue where compression values were not set correctly in FileSinkOperator, which resulted in uncompressed files.

Status: Submitted

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: n/a (HIVE-2266)

Fix UDAFPercentile to tolerate null percentiles

Fixes an issue where UDAFPercentile would throw a null pointer exception when passed null percentile list.

Status: Committed

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: 0.8.0 (HIVE-2298)

Fix hashCode method in DoubleWritable class

Fixes the hashCode() method of DoubleWritable class of Hive and prevents the HashMap (of type DoubleWritable) from behaving as LinkedList.

Status: Committed

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: 0.7.0 (HIVE-1629)

Recover partitions, version 2

Improved version of Recover Partitions that uses less memory.

Status: Not Submitted

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: n/a

HAVING clause

Use the HAVING clause to directly filter on groups by expressions (instead of using nested queries). Integrates Hive with other data analysis tools that rely on the HAVING expression.

Status: Committed

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: 0.7.0 (HIVE-1790)

Improve Hive query performance

Reduces startup time for queries spanning a large number of partitions.

Status: Committed

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: 0.8.0 (HIVE-2299)

Improve Hive query performance for Amazon S3 queries

Reduces startup time for Amazon S3 queries. Set Hive.optimize.s3.query=true to enable optimization.

The optimization flag assumes that the partitions are stored in standard Hive partitioning format: “HIVE_TABLE_ROOT/partititon1=value1/partition2=value2”. This is the format used by Hive to create partitions when you do not specify a custom location.

The partitions in an optimized query should have the same prefix, with HIVE_TABLE_ROOT as the common prefix.

Status: Not Submitted

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: n/a

Skip comments in Hive scripts

Fixes an issue where Hive scripts would fail on a comment line; now Hive scripts skip commented lines.

Status: Committed

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: 0.8.0 (HIVE-2259)

Limit Recover Partitions

Improves performance recovering partitions from Amazon S3 when there are many partitions to recover.

Status: Not Submitted

Fixed in AWS Hive Version: 0.8.1

Fixed in Apache Hive Version: n/a