Access HBase Tables with Hive
HBase and Hive and Amazon EMR (EMR 3.x Releases) are tightly integrated, allowing you run massively parallel processing workloads directly on data stored in HBase. To use Hive with HBase, you can usually launch them on the same cluster. You can, however, launch Hive and HBase on separate clusters. Running HBase and Hive separately on different clusters can improve performance because this allows each application to use cluster resources more efficiently.
The following procedures show how to connect to HBase on a cluster using Hive.
You can only connect a Hive cluster to a single HBase cluster.
To connect Hive to HBase
Create separate clusters with Hive and HBase installed or create a single cluster with both HBase and Hive installed.
If you are using separate clusters, modify your security groups so that HBase and Hive ports are open between these two master nodes.
Use SSH to connect to the master node for the cluster with Hive installed. For more information, see Connect to the Master Node Using SSH.
Launch the Hive shell with the following command.
(Optional) You do not need to do this if HBase and Hive are located on the same cluster. Connect the HBase client on your Hive cluster to the HBase cluster that contains your data. In the following example,
public-DNS-nameis replaced by the public DNS name of the master node of the HBase cluster, for example:
Proceed to run Hive queries on your HBase data as desired or see the next procedure.
To access HBase data from Hive
After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive.
The following example, when run from the Hive prompt, creates an external table that references data stored in an HBase table called
inputTable. You can then reference
inputTablein Hive statements to query and modify data stored in the HBase cluster.
The following example uses protobuf-java-2.4.0a.jar in AMI 2.3.3, but you should modify the example to match your version. To check which version of the Protocol Buffers JAR you have, run the command at the Hive command prompt:
! ls /home/hadoop/lib;.
add jar lib/emr-metrics-1.0.jar ; add jar lib/protobuf-java-2.4.0a.jar ; set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ; create external table inputTable (key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,f1:col1") tblproperties ("hbase.table.name" = "t1"); select count(*) from inputTable ;
For a more advanced use case and example combining HBase and Hive, see the AWS Big Data Blog post, Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR.