Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Launch an HBase Cluster on Amazon EMR

When you launch HBase on Amazon EMR, you get the benefits of running in the Amazon Web Services (AWS) cloud—easy scaling, low cost, pay only for what you use, and ease of use. The EMR team has tuned HBase to run optimally on AWS. For more information about HBase and running it on Amazon EMR, see Store Data with HBase.

The following procedure shows how to launch an HBase cluster with the default settings. If your application needs custom settings, you can configure HBase as described in Configure HBase.

Note

HBase configuration can only be done at launch time.

For production environments, we recommend that you launch HBase on one cluster and launch any analysis tools, such as Hive, on a separate cluster. This ensures that HBase has ready access to the CPU and memory resources it requires.

To launch an HBase cluster using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create New Job Flow.

    Amazon EMR console
  3. In the DEFINE JOB FLOW page, enter the following:

    1. Enter a name in the Job Flow Name field.

      We recommend that you use a descriptive name. It does not need to be unique.

    2. Select a version of Hadoop to run on your cluster in Hadoop Version. You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop.

    3. Select Run your own application.

    4. Select HBase in the drop-down list.

    5. Click Continue.

    New HBase Cluster
  4. In the SPECIFY PARAMETERS page, indicate whether you want to preload the HBase cluster with data stored in Amazon S3 and whether you want to schedule regular backups of your HBase cluster. Use the following table for guidance on making your selections. For more information about backing up and restoring HBase data, see Back Up and Restore HBase. When you are finished making selections, click Continue.

    FieldAction
    Restore from BackupSpecify whether to preload the HBase cluster with data stored in Amazon S3.
    Backup Location* Specify the URI where the backup to restore from resides in Amazon S3.
    Backup VersionOptionally, specify the version name of the backup at Backup Location to use. If you leave this field blank, Amazon EMR uses the latest backup at Backup Location to populate the new HBase cluster.
    Schedule Regular BackupsSpecify whether to schedule automatic incremental backups. The first backup will be a full backup to create a baseline for future incremental backups.
    Consistent Backup*Specify whether the backups should be consistent. A consistent backup is one which pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes.
    Backup Frequency*The number of Days/Hours/Minutes between scheduled backups.
    Backup Location*The Amazon S3 URI where backups will be stored. The backup location for each HBase cluster should be different to ensure that differential backups stay correct.
    Backup Start Time*Specify when the first backup should occur. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2012-06-15T20:00Z, would set the start time to June 15, 2012 at 8pm UTC.
    Install Additonal PackagesOptionally, add Hive or Pig to the HBase cluster. Because of performance considerations, best practice is to run HBase on one cluster and Hive or Pig on a different cluster. For testing purposes, however, you may wish to run Hive or Pig on the same cluster as HBase.

    * Required parameter

    Specify Pig Parameters
  5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the following table as a guide, and then click Continue.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    FieldAction
    Instance CountSpecify the number of nodes to use in the Hadoop cluster. There is always one master node in each cluster. You can specify the number of core and tasks nodes.
    Instance TypeSpecify the EC2 instance types to use as master, core, and task nodes. Valid types are m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hi1.4xlarge, hs1.8xlarge, or cc2.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) region. The hi1.4xlarge instance type is only supported in the US East (Northern Virginia) and EU (Ireland) regions.
    Request Spot InstancesSpecify whether to run master, core, or task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional)

    * Required parameter

    Configure EC2 Instances
  6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table as a guide, and then click Continue.

    FieldAction
    Amazon EC2 Key PairOptionally, specify a key pair that you created previously. For more information, see Create an Amazon EC2 Key Pair and PEM File. If you do not enter a value in this field, you cannot SSH into the master node.
    Amazon VPC Subnet Id

    Optionally, specify a VPC subnet identifier to launch the cluster in an Amazon VPC. Set this only if you need to launch the cluster into a specific VPC subnet, otherwise you can leave this set to the default: No preference. For more information about how Amazon VPC integrates with Amazon EMR, see Select a Amazon VPC Subnet for the Cluster (Optional).

    Amazon S3 Log PathOptionally, specify a path in Amazon S3 to store the Amazon EMR log files. The value must be in the form BucketName/path. If you do not supply a location, Amazon EMR does not log any files.
    Enable debuggingSelect Yes to store Amazon Elastic MapReduce-generated log files. You must enable debugging at this level if you want to store the log files generated by Amazon EMR.

    If you select Yes, you must supply an Amazon S3 bucket name where Amazon Elastic MapReduce can upload your log files.

    For more information, see Troubleshoot a Cluster.

    Important

    You can enable debugging for a cluster only when you initially create the cluster.

    Keep AliveSelect Yes to cause the cluster to continue running when all processing is completed.
    Termination ProtectionSelect Yes to ensure the cluster is not shut down due to accident or error. For more information, see Protect a Cluster from Termination.
    Visible To All IAM UsersSelect Yes to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    Advanced Options
  7. In the bootstrap ACTIONS page, select Proceed with no bootstrap Actions, and then click Continue.

    For more information about bootstrap actions, see Create Bootstrap Actions to Install Additional Software (Optional).

    bootstrap Actions
  8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.

    After you click Create Job Flow your request is processed; when it succeeds, a message appears.

    Amazon EMR console
  9. Click Close.

    The Amazon EMR console shows the new cluster starting. Starting a new cluster may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the cluster's progress.

    Amazon EMR console

To launch an HBase cluster using the CLI

  • Specify --hbase when you launch a cluster using the CLI.

    The following example shows how to launch a cluster running HBase from the CLI. We recommend that you run at least two instances in the HBase cluster .The --instance-type parameter must be one of the following: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hi1.4xlarge, hs1.8xlarge, or cc2.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) region. The hi1.4xlarge instance type is only supported in the US East (Northern Virginia) and EU (Ireland) regions.

    The CLI implicitly launches the HBase cluster with keep alive and termination protection set.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "$USER HBase Cluster" \
      --num-instances 2 \
      --instance-type cc1.4xlarge
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "$USER HBase Cluster" --num-instances 2 --instance-type cc1.4xlarge

To launch an HBase cluster using the API

  • You need to run the hbase-setup bootstrap action when you launch HBase using the API in order to install and configure HBase on the cluster. You also need to add a step to start the Hbase master. These are shown in the following example. The region, us-east-1, would be replaced by the region in which to launch the cluster. For a list of regions supported by Amazon EMR see Choose an AWS Region.

    https://us-east-1elasticmapreduce.amazonaws.com?
    Operation=RunJobFlow&
    Name=HBase Cluster&
    LogUri=s3://myawsbucket/subdir&
    Instances.MasterInstanceType=m1.xlarge&
    Instances.SlaveInstanceType=m1.xlarge&
    Instances.InstanceCount=4&
    Instances.Ec2KeyName=myec2keyname&
    Instances.Placement.AvailabilityZone=us-east-1a&
    Instances.KeepJobFlowAliveWhenNoSteps=true&
    Steps.member.1.Name=InstallHBase&
    Steps.member.1.ActionOnFailure=CONTINUE&
    Steps.member.1.bootstrapAction.ScriptbootstrapAction=s3://us-east-1.elasticmapreduce/bootstrap-actions/setup-hbase&
    Steps.member.1.Name=StartHBase&
    Steps.member.1.ActionOnFailure=CANCEL_AND_WAIT&
    Steps.member.1.HadoopJarStep.Jar=/home/hadoop/lib/hbase-0.92.0.jar&
    Steps.member.1.HadoopJarStep.Args.member.1=emr.hbase.backup.Main&
    Steps.member.1.HadoopJarStep.Args.member.2=--start-master&
    AWSAccessKeyId=AccessKeyID&
    SignatureVersion=2&
    SignatureMethod=HmacSHA256&
    Timestamp=2009-01-28T21%3A48%3A32.000Z&  
    Signature=calculated value