Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Store Data with HBase

HBase is an open source, non-relational, distributed database modeled after Google's BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop Distributed File System (HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and is highly efficient for batch inserts, updates, and deletes.

HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

Additionally, HBase on Amazon EMR provides the ability to back up your HBase data directly to Amazon Simple Storage Service (Amazon S3). You can also restore from a previously created backup when launching an HBase cluster.

What Can I Do with HBase?

You can use HBase for random, repeated access to and modification of large volumes of data. HBase provides low-latency lookups and range scans, along with efficient updates and deletions of individual records.

Here are several HBase use cases for you to consider:

  • Reference data for Hadoop analytics. With its direct integration with Hadoop and Hive and rapid access to stored data, HBase can be used to store reference data used by multiple Hadoop tasks or across multiple Hadoop clusters. This data can be stored directly on the cluster running Hadoop tasks or on a separate cluster. Types of analytics include analytics requiring fast access to demographic data, IP address geolocation lookup tables, and product dimensional data.

  • Real-time log ingestion and batch log analytics. HBase's high write throughput, optimization for sequential data, and efficient storage of sparse data make it a great solution for real-time ingestion of log data. At the same time, its integration with Hadoop and optimization for sequential reads and scans makes it equally suited for batch analysis of that log data after ingestion. Common use cases include ingestion and analysis of application logs, clickstream data, and in game usage data.

  • Store for high frequency counters and summary data. Counter increments aren't just database writes, they're read-modify-writes, so they're a very expensive operation for a relational database. However, because HBase is a nonrelational, distributed database, it supports very high update rates and, given its consistent reads and writes, provides immediate access to that updated data. In addition, if you want to run more complex aggregations on the data (such as max-mins, averages, and group-bys), you can run Hadoop jobs directly and feed the aggregated results back into HBase.