Menu
Amazon EMR
Amazon EMR Release Guide

Configure HBase

Although the default settings should work for most applications, you have the flexibility to modify your HBase configuration settings. To do this, use the configuration API when you create the cluster:

The following example creates a cluster with an alternate HBase root directory:

aws emr create-cluster --release-label emr-5.2.1 --applications Name=HBase \
--instance-type m3.xlarge --instance-count 2 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

Note

For Windows, replace the above Linux line continuation character (\) with the caret (^).

myConfig.json:

[
  {
    "Classification":"hbase-site",
    "Properties": {
       "hbase.rootdir": "hdfs://ip-XXX-XX-XX-XXX.ec2.internal:8020/user/myCustomHBaseDir"
    }
  }
]

Note

If you plan to store your configuration in Amazon S3, you must specify the URL location of the object. For example:

aws emr create-cluster --release-label emr-5.2.1 --applications Name=HBase \
--instance-type m3.xlarge --instance-count 3 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

Changes to Memory Allocation in YARN

HBase is not running as a YARN application, thus it is necessary to recalculate the memory allocated to YARN and its applications, which results in a reduction in overall memory available to YARN if HBase is installed. You should take this into account when planning to co-locate YARN applications and HBase on the same clusters. The instance types with less than 64 GB of memory have half the memory available to NodeManager, which is then allocated to the HBase RegionServer. For instance types with memory greater than 64 GB, HBase RegionServer memory is capped at 32 GB. As a general rule, YARN setting memory is some multiple of MapReduce reducer task memory.

The tables in Task JVM Memory Settings show changes to YARN settings based on the memory needed for HBase.

HBase Port Numbers

Some port numbers chosen for HBase are different from the default. The following are interfaces and ports for HBase on Amazon EMR.

HBase Ports

InterfacePortProtocol
HMaster16000TCP
HMaster UI16010HTTP
RegionServer16020TCP
RegionServer Info16030HTTP
REST server8070HTTP
REST UI8085HTTP
Thrift server9090TCP
Thrift server UI9095HTTP

Important

The Hadoop KMS port is changed in Amazon EMR release 4.6 or later. kms-http-port is now 9700 and kms-admin-port is 9701.

HBase Site Settings to Optimize

You can set any or all of the HBase site settings to optimize the HBase cluster for your application's workload. We recommend the following settings as a starting point in your investigation.

zookeeper.session.timeout

The default timeout is three minutes (180000 ms). If a region server crashes, this is how long it takes the master server to notice the absence of the region server and start recovery. To help the master server recover faster, you can reduce this value to a shorter time period. The following example uses one minute, or 60000 ms:

[
  {
    "Classification":"hbase-site",
    "Properties": {
       "zookeeper.session.timeout": "60000"
    }
  }
]

hbase.regionserver.handler.count

This defines the number of threads the region server keeps open to serve requests to tables. The default of 10 is low, in order to prevent users from killing their region servers when using large write buffers with a high number of concurrent clients. The rule of thumb is to keep this number low when the payload per request approaches the MB range (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The following example raises the number of open threads to 30:

[
  {
    "Classification":"hbase-site",
    "Properties": {
       "hbase.regionserver.handler.count": "30"
    }
  }
]

hbase.hregion.max.filesize

This parameter governs the size, in bytes, of the individual regions. By default, it is set to 256 MB. If you are writing a lot of data into your HBase cluster and it's causing frequent splitting, you can increase this size to make individual regions bigger. It reduces splitting but takes more time to load balance regions from one server to another.

[
  {
    "Classification":"hbase-site",
    "Properties": {
       "hbase.hregion.max.filesize": "1073741824"
    }
  }
]

hbase.hregion.memstore.flush.size

This parameter governs the maximum size of memstore, in bytes, before it is flushed to disk. By default, it is 64 MB. If your workload consists of short bursts of write operations, you might want to increase this limit so all writes stay in memory during the burst and get flushed to disk later. This can boost performance during bursts.

[
  {
    "Classification":"hbase-site",
    "Properties": {
       "hbase.hregion.memstore.flush.size": "134217728"
    }
  }
]