| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Apache Hive is a data warehouse application you can use to query data contained in Amazon EMR clusters using a SQL-like language. Because we launched the cluster as a Hive application, Amazon EMR installs Hive on the EC2 instances it launches to process the cluster. For more information about Hive, go to http://hive.apache.org/.
If you've followed the previous instructions to set up a cluster and use SSH to connect to the master node, you are ready to use Hive interactively.
To run Hive commands interactively
At the hadoop command prompt for the current master node, type
hive.
You should see a hive prompt: hive>
Enter a Hive command that maps a table in the Hive application to the data in Amazon DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in Amazon DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
The following shows the syntax for mapping a Hive table to an Amazon DynamoDB table.
CREATE EXTERNAL TABLEhive_tablename(hive_column1_namecolumn1_datatype,hive_column2_namecolumn2_datatype...) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" ="dynamodb_tablename", "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name...");
When you create a table in Hive from Amazon DynamoDB, you must create it as an
external table using the keyword EXTERNAL. The difference between
external and internal tables is that the data in internal tables is deleted when an
internal table is dropped. This is not the desired behavior when connected to Amazon
DynamoDB, and thus only external tables are supported.
For example, the following Hive command creates a table named "hivetable1" in Hive that references the Amazon DynamoDB table named "dynamodbtable1". The Amazon DynamoDB table "dynamodbtable1" has a hash-and-range primary key schema. The hash key element is "name" (string type), the range key element is "year" (numeric type), and each item has an attribute value for "holidays" (string set type).
CREATE EXTERNAL TABLEhiveTableName(col1 string, col2 bigint, col3 array<string>) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1", "dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays");
Line 1 uses the HiveQL CREATE EXTERNAL TABLE statement. For
"hivetable1", you need to establish a column for each attribute name-value pair in
the Amazon DynamoDB table, and provide the data type. These values are
not case-sensitive, and you can give the columns any name (except
reserved words).
Line 2 uses the STORED BY statement. The value of STORED
BY is the name of the class that handles the connection between Hive and
Amazon DynamoDB. It should be set to
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'.
Line 3 uses the TBLPROPERTIES statement to associate "hivetable1"
with the correct table and schema in Amazon DynamoDB. Provide
TBLPROPERTIES with values for the
dynamodb.table.name parameter and
dynamodb.column.mapping parameter. These values
are case-sensitive.
Note
All Amazon DynamoDB attribute names for the table must have corresponding columns in the Hive table. Otherwise, the Hive table won't contain the name-value pair from Amazon DynamoDB. If you do not map the Amazon DynamoDB primary key attributes, Hive generates an error. If you do not map a non-primary key attribute, no error is generated, but you won't see the data in the Hive table. If the data types do not match, the value is null.
Then you can start running Hive operations on "hivetable1". Queries run against hivetable1" are internally run against the Amazon DynamoDB table "dynamodbtable1" of your Amazon DynamoDB account, consuming read or write units with each execution.
When you run Hive queries against an Amazon DynamoDB table, you need to ensure that you have provisioned a sufficient amount of read capacity units.
For example, suppose that you have provisioned 100 units of Read Capacity for your DynamoDB table. This will let you perform 100 reads, or 102,400 bytes, per second. If that table contains 20GB of data (21,474,836,480 bytes), and your Hive query performs a full table scan, you can estimate how long the query will take to run:
21,474,836,480 / 102,400 = 209,715 seconds = 58.25 hours
The only way to decrease the time required would be to adjust the read capacity units on the source DynamoDB table. Adding more Amazon EMR nodes will not help.
In the Hive output, the completion percentage is updated when one or more mapper processes are finished. For a large DynamoDB table with a low provisioned Read Capacity setting, the completion percentage output might not be updated for a long time; in the case above, the job will appear to be 0% complete for several hours. For more detailed status on your job's progress, go to the Amazon EMR console; you will be able to view the individual mapper task status, and statistics for data reads.
You can also log on to Hadoop interface on the master node and see the Hadoop statistics. This will show you the individual map task status and some data read statistics. For more information, see the following topics:
For more information about sample HiveQL statements to perform tasks such as exporting or importing data from Amazon DynamoDB and joining tables, see Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB.
You can also create a file that contains a series of commands, launch a cluster, and reference that file to perform the operations. For more information, see Interactive and Batch Modes.
To cancel a Hive request
When you execute a Hive query, the initial response from the server includes the command to cancel the request. To cancel the request at any time in the process, use the Kill Command from the server response.
Enter Ctrl+C to exit the command line client.
At the shell prompt, enter the Kill Command from the initial server response to your request.
Alternatively, you can run the following command from the command line of the master node to kill the Hadoop job, where
job-id is the identifier of the Hadoop job and can be retrieved from the Hadoop user interface.
For more information about the Hadoop user interface, see
How to Use the Hadoop User Interface.
hadoop job -kill job-id
The following table shows the available Hive data types and how they map to the corresponding Amazon DynamoDB data types.
| Hive type | Amazon DynamoDB type |
|---|---|
| string | string (S) |
| bigint or double | number (N) |
| binary | binary (B) |
| array | number set (NS), string set (SS), or binary set (BS) |
The bigint type in Hive is the same as the Java long type, and the Hive double type is the same as the Java double type in terms of precision. This means that if you have numeric data stored in Amazon DynamoDB that has precision higher than is available in the Hive datatypes, using Hive to export, import, or reference the Amazon DynamoDB data could lead to a loss in precision or a failure of the Hive query.
Exports of the binary type from Amazon DynamoDB to Amazon Simple Storage Service (Amazon S3) or HDFS are stored as a Base64-encoded string. If you are importing data from Amazon S3 or HDFS into the Amazon DynamoDB binary type, it should be encoded as a Base64 string.
You can set the following Hive options to manage the transfer of data out of Amazon DynamoDB. These options only persist for the current Hive session. If you close the Hive command prompt and reopen it later on the job flow, these settings will have returned to command prompt and reopen it later on the cluster, these settings will have returned to command prompt and reopen it later on the cluster, these settings will have returned to the default values.
| Hive Options | Description |
|---|---|
dynamodb.throughput.read.percent |
Set the rate of read operations to keep your Amazon DynamoDB
provisioned throughput rate in the allocated range for your table.
The value is between The value of 0.5 is the default read rate, which means that Hive will attempt to consume half of the read provisioned throughout resources in the table. Increasing this value above 0.5 increases the read request rate. Decreasing it below 0.5 decreases the read request rate. This read rate is approximate. The actual read rate will depend on factors such as whether there is a uniform distribution of keys in Amazon DynamoDB. If you find your provisioned throughput is frequently exceeded by
the Hive operation, or if live read traffic is being throttled too
much, then reduce this value below |
dynamodb.throughput.write.percent |
Set the rate of write operations to keep your Amazon DynamoDB
provisioned throughput rate in the allocated range for your table.
The value is between The value of 0.5 is the default write rate, which means that Hive will attempt to consume half of the write provisioned throughout resources in the table. Increasing this value above 0.5 increases the write request rate. Decreasing it below 0.5 decreases the write request rate. This write rate is approximate. The actual write rate will depend on factors such as whether there is a uniform distribution of keys in Amazon DynamoDB If you find your provisioned throughput is frequently exceeded by
the Hive operation, or if live write traffic is being throttled too
much, then reduce this value below |
dynamodb.endpoint | Specify the endpoint in case you have tables in different regions. For more information about the available Amazon DynamoDB endpoints, see Regions and Endpoints. |
dynamodb.max.map.tasks |
Specify the maximum number of map tasks when reading data from Amazon DynamoDB. This value must be equal to or greater than 1. |
dynamodb.retry.duration |
Specify the number of minutes to use as the timeout duration for retrying Hive commands. This value must be an integer equal to or greater than 0. The default timeout duration is two minutes. |
These options are set using the SET command as shown in the following
example.
SET dynamodb.throughput.read.percent=1.0; INSERT OVERWRITE TABLEs3_exportSELECT * FROMhiveTableName;
If you are using the AWS SDK for Java, you can use the -e option of Hive to pass in the command directly, as shown in the last line of the following example.
steps.add(new StepConfig()
.withName("Run Hive Script")
.withHadoopJarStep(new HadoopJarStepConfig()
.withJar("s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar")
.withArgs("s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path","s3://us-east-1.elasticmapreduce/libs/hive/","--run-hive-script",
"--args","-e","SET dynamodb.throughput.read.percent=1.0;")));