Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Additional Features of Hive in Amazon EMR

Amazon EMR extends Hive with new features that support Hive integration with other AWS services, such as the ability to read from and write to Amazon Simple Storage Service (Amazon S3) and DynamoDB. For information about which versions of Hive support these additional features, see Hive Patches.

Write Data Directly to Amazon S3

The Hadoop Distributed File System (HDFS) and Amazon S3 are handled differently within Amazon EMR and Hive. The version of Hive installed with Amazon EMR is extended with the ability to write directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and Amazon S3 behave differently within Hive.

A consequence of Hive writing directly to Amazon S3 is that you cannot read and write to the same table within the same Hive statement if that table is located in Amazon S3. The following example shows how to use multiple Hive statements to update a table in Amazon S3.

To update a table in Amazon S3 using Hive

  1. From a Hive prompt or script, create a temporary table in the cluster's local HDFS filesystem.

  2. Write the results of a Hive query to the temporary table.

  3. Copy the contents of the temporary table to Amazon S3. This is shown in the following example.

    create temporary table tmp like my_s3_table ;
    insert overwrite tmp select .... ;
    insert overwrite my_s3_table select * from tmp ;
                    

Use Hive to Access Resources in Amazon S3

The version of Hive installed in Amazon EMR enables you to reference resources, such as JAR files, located in Amazon S3.

add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar 
         

You can also reference scripts located in Amazon S3 to execute custom map and reduce operations. This is shown in the following example.

from logs select transform (line) 
using 's3://mybucket/scripts/parse-logs.pl' as 
(time string, exception_type string, exception_details string)
         

The ability to initialize Hive from a file stored in Amazon S3 was introduced with Hive 0.8.1. Versions of Hive prior to 0.8.1 do not support initializing Hive from Amazon S3. For example, in the following Hive command, -i s3n://myawsbucket/hive_init.sql succeeds if run with Hive 0.8.1 or later, and fails if run with an earlier version of Hive.

hive -i s3n://myawsbucket/hive_init.sql -f s3n://myawsbucket/hive_example.sql 
        

Use Hive to Recover Partitions

We added a statement to the Hive query language that recovers the partitions of a table from table data located in Amazon S3. The following example shows this.

create external table (json string) raw_impression 
partitioned by (dt string) 
location 's3://elastic-mapreduce/samples/hive-ads/tables/impressions'
;
alter table logs recover partitions ; 
	    

The partition directories and data must be at the location specified in the table definition and must be named according to the Hive convention: for example, dt=2009-01-01.

Note

After Hive 0.13.1 this capability is supported natively using msck repair table. For more information, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL.

Variables in Hive

You can include variables in your scripts by using the dollar sign and curly braces.

add jar ${LIB}/jsonserde.jar
        

You pass the values of these variables to Hive on the command line using the -d parameter, as in the following example:

-d LIB=s3://elasticmapreduce/samples/hive-ads/lib
        

You can also pass the values into steps that execute Hive scripts.

To pass variable values into Hive steps using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Steps section, for Add Step, choose Hive Program from the list and click Configure and add.

  4. In the Add Step dialog, specify the parameters using the following table as a guide, and then click Add.

    FieldAction
    Script S3 location*Specify the URI where your script resides in Amazon S3. The value must be in the form BucketName/path/ScriptName. For example: s3://elasticmapreduce/samples/hive-ads/libs/response-time-stats.q.
    Input S3 locationOptionally, specify the URI where your input files reside in Amazon S3. The value must be in the form BucketName/path/. If specified, this will be passed to the Hive script as a parameter named INPUT. For example: s3://elasticmapreduce/samples/hive-ads/tables/.
    Output S3 locationOptionally, specify the URI where you want the output stored in Amazon S3. The value must be in the form BucketName/path. If specified, this will be passed to the Hive script as a parameter named OUTPUT. For example: s3://mybucket/hive-ads/output/.
    ArgumentsOptionally, enter a list of arguments (space-separated strings) to pass to Hive. If you defined a path variable in your Hive script named ${SAMPLE}, for example:
    create external table logs (requestBeginTime string, requestEndTime string, hostname string) partitioned by (dt string) \
    row format serde 'com.amazon.elasticmapreduce.JsonSerde' 
    with serdeproperties ( 'paths'='requestBeginTime, requestEndTime, hostname' ) location '${SAMPLE}/tables/impressions'; 

    To pass a value for the variable, type the following in the Arguments window:

    -d SAMPLE=s3://elasticmapreduce/samples/hive-ads/.

    Action on Failure

    This determines what the cluster does in response to any errors. The possible values for this setting are:

    • Terminate cluster: If the step fails, terminate the cluster. If the cluster has termination protection enabled AND keep alive enabled, it will not terminate.

    • Cancel and wait: If the step fails, cancel the remaining steps. If the cluster has keep alive enabled, the cluster will not terminate.

    • Continue: If the step fails, continue to the next step.

  5. Proceed with creating the cluster as described in Plan an Amazon EMR Cluster.

To pass variable values into Hive steps using the AWS CLI

  • Type the following command to pass a variable into a Hive step using the AWS CLI:

    aws emr create-cluster --applications Name=string --ami-version string \
    --instance-groups InstanceGroupType=string,InstanceCount=integer,InstanceType=string InstanceGroupType=string,InstanceCount=integer,InstanceType=string \
    --steps Type=string,Name="string",ActionOnFailure=string,Args=[-f,pathtoscript,-d,INPUT=pathtoinputdata,-d,OUTPUT=pathtooutputdata,-d,variable=string] \
    --no-auto-terminate

    For example:

    aws emr create-cluster --applications Name=Hive --ami-version 3.2.1 \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \
    --steps Type=Hive,Name="Hive Program",ActionOnFailure=CONTINUE,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/response-time-stats.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/,-d,SAMPLE=s3://elasticmapreduce/samples/hive-ads/] \
    --no-auto-terminate

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To pass variable values into Hive steps using the Amazon EMR CLI

To pass a Hive variable value into a step using the Amazon EMR CLI, type the --args parameter with the -d flag. In this example, you pass a value for the ${LIB} variable into the step.

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --hive-script --arg s3://mybucket/script.q \
      --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib
    • Windows users:

      ruby elastic-mapreduce --hive-script --arg s3://mybucket/script.q --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib

To pass variable values into Hive steps using the SDK

  • The following example demonstrates how to pass variables into steps using the SDK. For more information, see Class StepFactory in the AWS SDK for Java API Reference.

    StepFactory stepFactory = new StepFactory();
    
       StepConfig runHive = new StepConfig()
         .withName("Run Hive Script")
         .withActionOnFailure("TERMINATE_JOB_FLOW")
         .withHadoopJarStep(stepFactory.newRunHiveScriptStep(“s3://mybucket/script.q”,
          Lists.newArrayList(“-d”,”LIB= s3://elasticmapreduce/samples/hive-ads/lib”));
    

Amazon EMR Hive queries to accommodate partial DynamoDB schemas

Amazon EMR Hive provides maximum flexibility when querying DynamoDB tables by allowing you to specify a subset of columns on which you can filter data, rather than requiring your query to include all columns. This partial schema query technique is effective when you have a sparse database schema and want to filter records based on a few columns, such as filtering on time stamps.

The following example shows how to use a Hive query to:

  • Create a DynamoDB table.

  • Select a subset of items (rows) in DynamoDB and further narrow the data to certain columns.

  • Copy the resulting data to Amazon S3.

DROP TABLE dynamodb; 
DROP TABLE s3;

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.throughput.read.percent" = ".1000", 
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey"); 

CREATE EXTERNAL TABLE s3(map<String, String>)
     ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
     LOCATION 's3://bucketname/path/subpath/';

INSERT OVERWRITE s3 SELECT item fullColumn FROM dynamodb WHERE recordTimeStamp < "2012-01-01";

The following table shows the query syntax for selecting any combination of items from DynamoDB.

Query ExampleResult Description
SELECT * FROM table_name;Selects all items (rows) from a given table and includes data from all columns available for those items.
SELECT * FROM table_name WHERE field_name =value;Selects some items (rows) from a given table and includes data from all columns available for those items.
SELECT column1_name, column2_name, column3_name FROM table_name;Selects all items (rows) from a given table and includes data from some columns available for those items.
SELECT column1_name, column2_name, column3_name FROM table_name WHERE field_name =value;Selects some items (rows) from a given table and includes data from some columns available for those items.

Copy data between DynamoDB tables in different AWS regions

Amazon EMR Hive provides a dynamodb.region property you can set per DynamoDB table. When dynamodb.region is set differently on two tables, any data you copy between the tables automatically occurs between the specified regions.

The following example shows you how to create a DynamoDB table with a Hive script that sets the dynamodb.region property:

Note

Per-table region properties override the global Hive properties. For more information, see Hive Options.

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.region" = "eu-west-1", 
     "dynamodb.throughput.read.percent" = ".1000", 
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");

Set DynamoDB throughput values per table

Amazon EMR Hive enables you to set the DynamoDB readThroughputPercent and writeThroughputPercent settings on a per table basis in the table definition. The following Amazon EMR Hive script shows how to set the throughput values. For more information about DynamoDB throughput values, see Specifying Read and Write Requirements for Tables.

Note

These per table throughput properties will override the global Hive properties as mentioned in this section: Hive Options.

CREATE EXTERNAL TABLE dynamodb(hashKey STRING, recordTimeStamp BIGINT, map<String, String> fullColumn)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ( 
     "dynamodb.table.name" = "myTable",
     "dynamodb.throughput.read.percent" = ".4",
     "dynamodb.throughput.write.percent" = "1.0",
     "dynamodb.column.mapping" = "hashKey:HashKey,recordTimeStamp:RangeKey");

Hive Patches

The Amazon EMR team has created the following patches for Hive.

PatchDescription
Write to Amazon S3

Supports moving data between different file systems, such as HDFS and Amazon S3. Adds support for file systems (such as Amazon S3) that do not provide a “move” operation. Removes redundant operations like moving data to and from the same location.

Status: Submitted

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a (HIVE-2318)

Scripts in Amazon S3

Enables Hive to download the Hive scripts in Amazon S3 buckets and run them. Saves you the step of copying scripts to HDFS before running them.

Status: Committed

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: 0.7.0 (HIVE-1624)

Recover partitions

Allows you to recover partitions from table data located in Amazon S3 and Hive table data in HDFS.

Status: Not Submitted

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a

Variables in Hive

Create a separate namespace (aside from HiveConf) for managing Hive variables. Adds support for setting variables on the command line using either '-define x=y' or 'set hivevar:x=y'. Adds support for referencing variables in statements using '${var_name}'. Provides a means for differentiating between hiveconf, hivevar, system, and environment properties in the output of 'set -v'.

Status: Committed

Fixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: 0.8.0 (HIVE-2020)

Report progress while writing to Amazon S3

FileSinkOperator reports progress to Hadoop while writing large files, so that the task is not killed.

Status: Not Submitted

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: n/a

Fix compression arguments

Corrects an issue where compression values were not set correctly in FileSinkOperator, which resulted in uncompressed files.

Status: Submitted

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: n/a (HIVE-2266)

Fix UDAFPercentile to tolerate null percentiles

Fixes an issue where UDAFPercentile would throw a null pointer exception when passed null percentile list.

Status: Committed

Fixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: 0.8.0 (HIVE-2298)

Fix hashCode method in DoubleWritable class

Fixes the hashCode() method of DoubleWritable class of Hive and prevents the HashMap (of type DoubleWritable) from behaving as LinkedList.

Status: Committed

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: 0.7.0 (HIVE-1629)

Recover partitions, version 2

Improved version of Recover Partitions that uses less memory.

Status: Not Submitted

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: n/a

HAVING clause

Use the HAVING clause to directly filter on groups by expressions (instead of using nested queries). Integrates Hive with other data analysis tools that rely on the HAVING expression.

Status: Committed

Fixed in AWS Hive Version: 0.7

Fixed in Apache Hive Version: 0.7.0 (HIVE-1790)

Improve Hive query performance

Reduces startup time for queries spanning a large number of partitions.

Status: Committed

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: 0.8.0 (HIVE-2299)

Improve Hive query performance for Amazon S3 queries

Reduces startup time for Amazon S3 queries. Set Hive.optimize.s3.query=true to enable optimization.

The optimization flag assumes that the partitions are stored in standard Hive partitioning format: “HIVE_TABLE_ROOT/partititon1=value1/partition2=value2”. This is the format used by Hive to create partitions when you do not specify a custom location.

The partitions in an optimized query should have the same prefix, with HIVE_TABLE_ROOT as the common prefix.

Status: Not Submitted

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: n/a

Skip comments in Hive scripts

Fixes an issue where Hive scripts would fail on a comment line; now Hive scripts skip commented lines.

Status: Committed

Fixed in AWS Hive Version: 0.7.1

Fixed in Apache Hive Version: 0.8.0 (HIVE-2259)

Limit Recover Partitions

Improves performance recovering partitions from Amazon S3 when there are many partitions to recover.

Status: Not Submitted

Fixed in AWS Hive Version: 0.8.1

Fixed in Apache Hive Version: n/a