|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
To use Hive with HBase you'll typically want to launch two clusters, one to run HBase and the other to run Hive. Running HBase and Hive separately can improve performance because this allows HBase to fully utilize the cluster resources.
Although it is not recommended for most use cases, you can also run Hive and HBase on the same cluster.
A copy of HBase is installed on the AMI with Hive to provide connection infrastructure to access your HBase cluster. The following sections show how to use the client portion of the copy of HBase on your Hive cluster to connect to HBase on another cluster.
The connection between the Hive and HBase clusters is structured as shown in the following diagram.
You can use Hive to connect to HBase and manipulate data, performing such actions as exporting data to Amazon S3, importing data from Amazon S3, and querying HBase data.
You can only connect your Hive cluster to a single HBase cluster.
To connect Hive to HBase
Create an interactive Hive cluster. Use Hive version 0.7 or later and AMI version 2.0.4 or later. The following example shows how to launch such a cluster using the Amazon EMR CLI.
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --alive --instance-type m2.4xlarge --num-instances 4 \ --hive-interactive --hive-versions latest
ruby elastic-mapreduce --create --alive --instance-type m2.4xlarge --num-instances 4 --hive-interactive --hive-versions latest
Use SSH to connect to the master node. For more information, see Connect to the Master Node Using SSH.
Launch the Hive shell with the following command.
Connect the HBase client on your Hive cluster to the HBase cluster that contains your data. In the following example,
public-DNS-name is replaced by the public DNS name of the master node of the HBase cluster, for example:
ec2-50-19-76-67.compute-1.amazonaws.com. For more information, see To locate the public DNS name of the master node using the Amazon EMR console.
To access HBase data from Hive
After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive.
The following example, when run from the Hive prompt, creates an external table that references data stored in an HBase table called
inputTable. You can then reference
inputTable in Hive statements to query and modify data stored in the HBase cluster.
The following example uses protobuf-java-2.4.0a.jar in AMI 2.3.3, but you
should modify the example to match your version. To check which version of
the Protocol Buffers JAR you have, run the command at the Hive command
! ls /home/hadoop/lib;
add jar lib/emr-metrics-1.0.jar ; add jar lib/protobuf-java-2.4.0a.jar ; set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ; create external table inputTable (key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,fam1:col1") tblproperties ("hbase.table.name" = "inputTable"); select count(*) from inputTable ;