Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Step 4: Set Up a Hive Table to Run Hive Commands

Apache Hive is a data warehouse application you can use to query data contained in Amazon EMR clusters using a SQL-like language. Because we launched the cluster as a Hive application, Amazon EMR installs Hive on the EC2 instances it launches to process the cluster. For more information about Hive, go to http://hive.apache.org/.

If you've followed the previous instructions to set up a cluster and use SSH to connect to the master node, you are ready to use Hive interactively.

To run Hive commands interactively

  1. At the hadoop command prompt for the current master node, type hive.

    You should see a hive prompt: hive>

  2. Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.

    The following shows the syntax for mapping a Hive table to an DynamoDB table.

    CREATE EXTERNAL TABLE hive_tablename (hive_column1_name column1_datatype, hive_column2_name column2_datatype...)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ("dynamodb.table.name" = "dynamodb_tablename", 
    "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name...");
                    

    When you create a table in Hive from DynamoDB, you must create it as an external table using the keyword EXTERNAL. The difference between external and internal tables is that the data in internal tables is deleted when an internal table is dropped. This is not the desired behavior when connected to Amazon DynamoDB, and thus only external tables are supported.

    For example, the following Hive command creates a table named hivetable1 in Hive that references the DynamoDB table named dynamodbtable1. The DynamoDB table dynamodbtable1 has a hash-and-range primary key schema. The hash key element is name (string type), the range key element is year (numeric type), and each item has an attribute value for holidays (string set type).

    CREATE EXTERNAL TABLE hivetable1 (col1 string, col2 bigint, col3 array<string>)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1", 
    "dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays"); 
                    

    Line 1 uses the HiveQL CREATE EXTERNAL TABLE statement. For hivetable1, you need to establish a column for each attribute name-value pair in the DynamoDB table, and provide the data type. These values are not case-sensitive, and you can give the columns any name (except reserved words).

    Line 2 uses the STORED BY statement. The value of STORED BY is the name of the class that handles the connection between Hive and DynamoDB. It should be set to 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'.

    Line 3 uses the TBLPROPERTIES statement to associate "hivetable1" with the correct table and schema in DynamoDB. Provide TBLPROPERTIES with values for the dynamodb.table.name parameter and dynamodb.column.mapping parameter. These values are case-sensitive.

    Note

    All DynamoDB attribute names for the table must have corresponding columns in the Hive table. Otherwise, the Hive table won't contain the name-value pair from DynamoDB. If you do not map the DynamoDB primary key attributes, Hive generates an error. If you do not map a non-primary key attribute, no error is generated, but you won't see the data in the Hive table. If the data types do not match, the value is null.

Then you can start running Hive operations on hivetable1. Queries run against hivetable1 are internally run against the DynamoDB table dynamodbtable1 of your DynamoDB account, consuming read or write units with each execution.

When you run Hive queries against an DynamoDB table, you need to ensure that you have provisioned a sufficient amount of read capacity units.

For example, suppose that you have provisioned 100 units of read capacity for your DynamoDB table. This will let you perform 100 reads, or 409,600 bytes, per second. If that table contains 20GB of data (21,474,836,480 bytes), and your Hive query performs a full table scan, you can estimate how long the query will take to run:

21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours

The only way to decrease the time required would be to adjust the read capacity units on the source DynamoDB table. Adding more Amazon EMR nodes will not help.

In the Hive output, the completion percentage is updated when one or more mapper processes are finished. For a large DynamoDB table with a low provisioned read capacity setting, the completion percentage output might not be updated for a long time; in the case above, the job will appear to be 0% complete for several hours. For more detailed status on your job's progress, go to the Amazon EMR console; you will be able to view the individual mapper task status, and statistics for data reads.

You can also log on to Hadoop interface on the master node and see the Hadoop statistics. This will show you the individual map task status and some data read statistics. For more information, see the following topics:

For more information about sample HiveQL statements to perform tasks such as exporting or importing data from DynamoDB and joining tables, see Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB in the Amazon EMR Developer Guide.

You can also create a file that contains a series of commands, launch a cluster, and reference that file to perform the operations. For more information, see Interactive and Batch Modes in the Amazon EMR Developer Guide.

To cancel a Hive request

When you execute a Hive query, the initial response from the server includes the command to cancel the request. To cancel the request at any time in the process, use the Kill Command from the server response.

  1. Enter Ctrl+C to exit the command line client.

  2. At the shell prompt, enter the Kill Command from the initial server response to your request.

    Alternatively, you can run the following command from the command line of the master node to kill the Hadoop job, where job-id is the identifier of the Hadoop job and can be retrieved from the Hadoop user interface. For more information about the Hadoop user interface, see How to Use the Hadoop User Interface in the Amazon EMR Developer Guide.

    hadoop job -kill job-id
                    

Data Types for Hive and DynamoDB

The following table shows the available Hive data types and how they map to the corresponding DynamoDB data types.

Hive typeDynamoDB type
string

string (S)

bigint or double

number (N)

binary

binary (B)

array

number set (NS), string set (SS), or binary set (BS)

The bigint type in Hive is the same as the Java long type, and the Hive double type is the same as the Java double type in terms of precision. This means that if you have numeric data stored in DynamoDB that has precision higher than is available in the Hive datatypes, using Hive to export, import, or reference the DynamoDB data could lead to a loss in precision or a failure of the Hive query.

Exports of the binary type from DynamoDB to Amazon Simple Storage Service (Amazon S3) or HDFS are stored as a Base64-encoded string. If you are importing data from Amazon S3 or HDFS into the DynamoDB binary type, it should be encoded as a Base64 string.

Hive Options

You can set the following Hive options to manage the transfer of data out of Amazon DynamoDB. These options only persist for the current Hive session. If you close the Hive command prompt and reopen it later on the cluster, these settings will have returned to the default values.

Hive OptionsDescription
dynamodb.throughput.read.percent

Set the rate of read operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.

The value of 0.5 is the default read rate, which means that Hive will attempt to consume half of the read provisioned throughout resources in the table. Increasing this value above 0.5 increases the read request rate. Decreasing it below 0.5 decreases the read request rate. This read rate is approximate. The actual read rate will depend on factors such as whether there is a uniform distribution of keys in DynamoDB.

If you find your provisioned throughput is frequently exceeded by the Hive operation, or if live read traffic is being throttled too much, then reduce this value below 0.5. If you have enough capacity and want a faster Hive operation, set this value above 0.5. You can also oversubscribe by setting it up to 1.5 if you believe there are unused input/output operations available.

dynamodb.throughput.write.percent

Set the rate of write operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.

The value of 0.5 is the default write rate, which means that Hive will attempt to consume half of the write provisioned throughout resources in the table. Increasing this value above 0.5 increases the write request rate. Decreasing it below 0.5 decreases the write request rate. This write rate is approximate. The actual write rate will depend on factors such as whether there is a uniform distribution of keys in DynamoDB

If you find your provisioned throughput is frequently exceeded by the Hive operation, or if live write traffic is being throttled too much, then reduce this value below 0.5. If you have enough capacity and want a faster Hive operation, set this value above 0.5. You can also oversubscribe by setting it up to 1.5 if you believe there are unused input/output operations available or this is the initial data upload to the table and there is no live traffic yet.

dynamodb.endpoint

Specify the endpoint in case you have tables in different regions. For more information about the available DynamoDB endpoints, see Regions and Endpoints.

dynamodb.max.map.tasks

Specify the maximum number of map tasks when reading data from DynamoDB. This value must be equal to or greater than 1.

dynamodb.retry.duration

Specify the number of minutes to use as the timeout duration for retrying Hive commands. This value must be an integer equal to or greater than 0. The default timeout duration is two minutes.

These options are set using the SET command as shown in the following example.

SET dynamodb.throughput.read.percent=1.0; 

INSERT OVERWRITE TABLE s3_export SELECT * 
FROM hiveTableName;                         
            
        

If you are using the AWS SDK for Java, you can use the -e option of Hive to pass in the command directly, as shown in the last line of the following example.

steps.add(new StepConfig()
.withName("Run Hive Script")
.withHadoopJarStep(new HadoopJarStepConfig()
.withJar("s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar")
.withArgs("s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path","s3://us-east-1.elasticmapreduce/libs/hive/","--run-hive-script",
"--args","-e","SET dynamodb.throughput.read.percent=1.0;")));