Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Configure HBase

Although the default settings should work for most applications, you have the flexibility to modify your HBase configuration settings. To do this, you run one of two bootstrap action scripts:

  • configure-hbase-daemons—Configures properties of the master, regionserver, and zookeeper daemons. These properties include heap size and options to pass to the Java Virtual Machine (JVM) when the HBase daemon starts. You set these properties as arguments in the bootstrap action. This bootstrap action modifies the /home/hadoop/conf/hbase-user-env.sh configuration file on the HBase cluster.

  • configure-hbase—Configures HBase site-specific settings such as the port the HBase master should bind to and the maximum number of times the client CLI client should retry an action. You can set these one-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configuration file in Amazon S3. This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configuration file on the HBase cluster.

Note

These scripts, like other bootstrap actions, can only be run when the cluster is created, you cannot use them to change the configuration of an HBase cluster that is currently running.

When you run the configure-hbase or configure-hbase-daemons bootstrap actions, the values you specify override the default values. Any values you don't explicitly set receive the default values.

Configuring HBase with these bootstrap actions is analogous to using bootstrap actions in Amazon EMR to configure Hadoop settings and Hadoop daemon properties. The difference is that HBase does not have per-process memory options. Instead, memory options are set using the --daemon-opts argument, where daemon is replaced by the name of the daemon to configure.

Configure HBase Daemons

Amazon EMR provides a bootstrap action, s3://region.elasticmapreduce/bootstrap-actions/configure-hbase-daemons, that you can use to change the configuration of HBase daemons, where region is the region into which you're launching your HBase cluster.

For a list of regions supported by Amazon EMR see Choose an AWS Region. The bootstrap action can only be run when the HBase cluster is launched.

You can configure a bootstrap action using the console, the AWS CLI, the Amazon EMR CLI, or the API. For more information on configuring bootstrap actions, see Create Bootstrap Actions to Install Additional Software (Optional)

To configure HBase daemons using the AWS CLI

Add the bootstrap action, configure-hbase-daemons, when you launch the cluster to configure one or more HBase daemons.

You can set the following properties with the configure-hbase-daemons bootstrap action.

VariableDescription
hbase-master-optsOptions that control how the JVM runs the master daemon. If set, these settings override the default HBASE_MASTER_OPTS variables.
regionserver-optsOptions that control how the JVM runs the region server daemon. If set, these settings override the default HBASE_REGIONSERVER_OPTS variables.
zookeeper-optsOptions that control how the JVM runs the zookeeper daemon. If set, these settings override the default HBASE_ZOOKEEPER_OPTS variables.

For more information about these options, go to http://hbase.apache.org/configuration.html#hbase.env.sh.

  • To create a new cluster with HBase installed and to use a bootstrap action to configure HBase daemons, type the following command:

    aws emr create-cluster --ami-version string --applications Name=string --name string --instance-count integer --instance-type string \
    --bootstrap-action Path=string,Args=[arg1,arg2]

    For example, to create a new cluster with HBase installed and to use a bootstrap action to set values for zookeeper-opts and hbase-master-opts (to configure the options used by the zookeeper and master node components of the cluster) type:

    aws emr create-cluster --ami-version 3.1.1 --applications Name=HBase --name "My HBase Cluster" --instance-count 5 --instance-type c1.xlarge \
    --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons,Args=["--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19","--hbase-master-opts=-Xmx2048m","--hbase-regionserver-opts=-Xmx4096m"]

    Note

    When you specify the instance count without using the --instance-groups parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To configure HBase daemons using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Add a bootstrap action, configure-hbase-daemons, when you launch the HBase cluster. You can use this bootstrap action to configure one or more daemons.

    The following example creates a new HBase cluster and uses the configure-hbase-daemons bootstrap action to set values for zookeeper-opts and hbase-master-opts which configure the options used by the zookeeper and master node components of the HBase cluster.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \  
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons --args "--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19,--hbase-master-opts=-Xmx2048m,--hbase-regionserver-opts=-Xmx4096m"
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase-daemons --args "--hbase-zookeeper-opts=-Xmx1024m -XX:GCTimeRatio=19,--hbase-master-opts=-Xmx2048m,--hbase-regionserver-opts=-Xmx4096m"

Note

When you specify the arguments for this bootstrap action, you must put quotes around the --args parameter value to keep the shell from breaking the arguments up. You must also include a space character between JVM arguments; in the example above there is a space between -Xmx1000M and -XX:GCTimeRatio=19.

Configure HBase Site Settings

Amazon EMR provides a bootstrap action, s3://elasticmapreduce/bootstrap-actions/configure-hbase, that you can use to change the configuration of HBase. You can set configuration values one-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configuration file in Amazon S3. Setting configuration values one-by-one is useful if you only need to set a few configuration settings. Setting them using an XML file is useful if you have many changes to make, or if you want to save your configuration settings for reuse.

Note

You can prefix the Amazon S3 bucket name with a region prefix, such as s3://region.elasticmapreduce/bootstrap-actions/configure-hbase, where region is the region into which you're launching your HBase cluster. For a list of all the regions supported by Amazon EMR see Choose an AWS Region.

This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configuration file on the HBase cluster. The bootstrap action can only be run when the HBase cluster is launched. For more information on configuring bootstrap actions, see Create Bootstrap Actions to Install Additional Software (Optional)

For a complete list of the HBase site settings that you can configure, go to http://hbase.apache.org/configuration.html#hbase.site.

To specify individual HBase site settings using the AWS CLI

Set the configure-hbase bootstrap action when you launch the HBase cluster, and specify the values within hbase-site.xml to change.

  • Type the following command to configure HBase using a bootstrap action:

    aws emr create-cluster --ami-version string --applications Name=string --name string --instance-count integer --instance-type string \
    --bootstrap-action Path=string,Args=[arg1,arg2]

    For example, to change the hbase.hregion.max.filesize settings type:

    aws emr create-cluster --ami-version 3.1.1 --applications Name=HBase --name "My HBase Cluster" --instance-count 5 --instance-type c1.xlarge \
    --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","hbase.hregion.max.filesize=52428800"]

    Note

    When you specify the instance count without using the --instance-groups parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To specify individual HBase site settings using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Set the configure-hbase bootstrap action when you launch the HBase cluster, and specify the values within hbase-site.xml to change. The following example illustrates how to change the hbase.hregion.max.filesize settings.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \   
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase \
      --args -s,hbase.hregion.max.filesize=52428800
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase --args -s,hbase.hregion.max.filesize=52428800

To specify HBase site settings with an XML file using the AWS CLI

  1. Create a custom version of hbase-site.xml. Your custom file must be valid XML. To reduce the chance of introducing errors, start with the default copy of hbase-site.xml, located on the Amazon EMR HBase master node at /home/hadoop/conf/hbase-site.xml, and edit a copy of that file instead of creating a file from scratch. You can give your new file a new name, or leave it as hbase-site.xml.

  2. Upload your custom hbase-site.xml file to an Amazon S3 bucket. It should have permissions set so the AWS account that launches the cluster can access the file. If the AWS account launching the cluster also owns the Amazon S3 bucket, it will have access.

  3. Set the configure-hbase bootstrap action when you launch the HBase cluster, and pass in the location of your custom hbase-site.xml file.

    The following example sets the HBase site configuration values to those specified in the file s3://mybucket/my-hbase-site.xml:

    aws emr create-cluster --ami-version 3.1.1 --applications Name=HBase --name "My HBase Cluster" --instance-count 5 --instance-type c1.xlarge \
    --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["--site-config-file","s3://mybucket/config.xml"]

    Note

    When you specify the instance count without using the --instance-groups parameter, a single master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To specify HBase site settings with an XML file using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  1. Create a custom version of hbase-site.xml. Your custom file must be valid XML. To reduce the chance of introducing errors, start with the default copy of hbase-site.xml, located on the Amazon EMR HBase master node at /home/hadoop/conf/hbase-site.xml, and edit a copy of that file instead of creating a file from scratch. You can give your new file a new name, or leave it as hbase-site.xml.

  2. Upload your custom hbase-site.xml file to an Amazon S3 bucket. It should have permissions set so the AWS account that launches the cluster can access the file. If the AWS account launching the cluster also owns the Amazon S3 bucket, it will have access.

  3. Set the configure-hbase bootstrap action when you launch the HBase cluster, and pass in the location of your custom hbase-site.xml file.

    The following example sets the HBase site configuration values to those specified in the file s3://myawsbucket/my-hbase-site.xml.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "My HBase Cluster" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase \
      --args --site-config-file s3://bucket/config.xml
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "My HBase Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase --args --site-config-file s3://bucket/config.xml

HBase Site Settings to Optimize

You can set any or all of the HBase site settings to optimize the HBase cluster for your application's workload. We recommend the following settings as a starting point in your investigation. If you specify more than one option you must prepend each key-value pair with a -s option switch. All options below are for the AWS CLI. For more information on using the parameters in the Amazon EMR CLI, see the Command Line Interface Reference for Amazon EMR.

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

zookeeper.session.timeout

The default timeout is three minutes (180000 ms). If a region server crashes, this is how long it takes the master server to notice the absence of the region server and start recovery. If you want the master server to recover faster, you can reduce this value to a shorter time period. The following example uses one minute, or 60000 ms.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","zookeeper.session.timeout=60000"]
				

hbase.regionserver.handler.count

This defines the number of threads the region server keeps open to serve requests to tables. The default of 10 is low, in order to prevent users from killing their region servers when using large write buffers with a high number of concurrent clients. The rule of thumb is to keep this number low when the payload per request approaches the MB range (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The following example raises the number of open threads to 30.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","hbase.regionserver.handler.count=30"]
			

hbase.hregion.max.filesize

This parameter governs the size, in bytes, of the individual regions. By default, it is set to 256 MB. If you are writing a lot of data into your HBase cluster and it's causing frequent splitting, you can increase this size to make individual regions bigger. It will reduce splitting, but it will take more time to load balance regions from one server to another.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","hbase.hregion.max.filesize=1073741824"]
			

hbase.hregion.memstore.flush.size

This parameter governs the maximum size of memstore, in bytes, before it is flushed to disk. By default it is 64 MB. If your workload consists of short bursts of write operations, you might want to increase this limit so all writes stay in memory during the burst and get flushed to disk later. This can boost performance during bursts¬.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hbase,Args=["-s","hbase.hregion.memstore.flush.size=134217728"]