Access HBase Data with Hive
To use Hive with HBase you'll typically want to launch two clusters, one to run HBase and the other to run Hive. Running HBase and Hive separately can improve performance because this allows HBase to fully utilize the cluster resources.
Although it is not recommended for most use cases, you can also run Hive and HBase on the same cluster.
A copy of HBase is installed on the AMI with Hive to provide connection infrastructure to access your HBase cluster. The following sections show how to use the client portion of the copy of HBase on your Hive cluster to connect to HBase on another cluster.
You can use Hive to connect to HBase and manipulate data, performing such actions as exporting data to Amazon S3, importing data from Amazon S3, and querying HBase data.
You can only connect your Hive cluster to a single HBase cluster.
To connect Hive to HBase
Use SSH to connect to the master node. For more information, see Connect to the Master Node Using SSH.
Launch the Hive shell with the following command.
Connect the HBase client on your Hive cluster to the HBase cluster that contains your data. In the following example,
public-DNS-nameis replaced by the public DNS name of the master node of the HBase cluster, for example:
ec2-50-19-76-67.compute-1.amazonaws.com. For more information, see To retrieve the public DNS name of the master node using the Amazon EMR console.
To access HBase data from Hive
After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive.
The following example, when run from the Hive prompt, creates an external table that references data stored in an HBase table called
inputTable. You can then reference
inputTablein Hive statements to query and modify data stored in the HBase cluster.
The following example uses protobuf-java-2.4.0a.jar in AMI 2.3.3, but you should modify the example to match your version. To check which version of the Protocol Buffers JAR you have, run the command at the Hive command prompt:
! ls /home/hadoop/lib;
add jar lib/emr-metrics-1.0.jar ; add jar lib/protobuf-java-2.4.0a.jar ; set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ; create external table inputTable (key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,fam1:col1") tblproperties ("hbase.table.name" = "inputTable"); select count(*) from inputTable ;