Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Access HBase Data with Hive

To use Hive with HBase you'll typically want to launch two clusters, one to run HBase and the other to run Hive. Running HBase and Hive separately can improve performance because this allows HBase to fully utilize the cluster resources.

Although it is not recommended for most use cases, you can also run Hive and HBase on the same cluster.

A copy of HBase is installed on the AMI with Hive to provide connection infrastructure to access your HBase cluster. The following sections show how to use the client portion of the copy of HBase on your Hive cluster to connect to HBase on another cluster.

You can use Hive to connect to HBase and manipulate data, performing such actions as exporting data to Amazon S3, importing data from Amazon S3, and querying HBase data.

Note

You can only connect your Hive cluster to a single HBase cluster.

To connect Hive to HBase

  1. Create a cluster with Hive installed using the steps in Launch a Cluster and Submit Hive Work and create a cluster with HBase installed using the steps in Install HBase on an Amazon EMR Cluster.

  2. Use SSH to connect to the master node. For more information, see Connect to the Master Node Using SSH.

  3. Launch the Hive shell with the following command.

    hive
    				
  4. Connect the HBase client on your Hive cluster to the HBase cluster that contains your data. In the following example, public-DNS-name is replaced by the public DNS name of the master node of the HBase cluster, for example: ec2-50-19-76-67.compute-1.amazonaws.com. For more information, see To retrieve the public DNS name of the master node using the Amazon EMR console.

    set hbase.zookeeper.quorum=public-DNS-name;
    				

To access HBase data from Hive

  • After the connection between the Hive and HBase clusters has been made (as shown in the previous procedure), you can access the data stored on the HBase cluster by creating an external table in Hive.

    The following example, when run from the Hive prompt, creates an external table that references data stored in an HBase table called inputTable. You can then reference inputTable in Hive statements to query and modify data stored in the HBase cluster.

    Note

    The following example uses protobuf-java-2.4.0a.jar in AMI 2.3.3, but you should modify the example to match your version. To check which version of the Protocol Buffers JAR you have, run the command at the Hive command prompt: ! ls /home/hadoop/lib;

    add jar lib/emr-metrics-1.0.jar ;
    add jar lib/protobuf-java-2.4.0a.jar ;
    
    set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ;
    
    create external table inputTable (key string, value string)
         stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          with serdeproperties ("hbase.columns.mapping" = ":key,fam1:col1")
          tblproperties ("hbase.table.name" = "inputTable");
    
    select count(*) from inputTable ;