Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Launch an HBase Cluster on Amazon EMR

When you launch HBase on Amazon EMR, you get the benefits of running in the Amazon Web Services (AWS) cloud—easy scaling, low cost, pay only for what you use, and ease of use. The EMR team has tuned HBase to run optimally on AWS. For more information about HBase and running it on Amazon EMR, see Store Data with HBase.

The following procedure shows how to launch an HBase cluster with the default settings. If your application needs custom settings, you can configure HBase as described in Configure HBase.

Note

HBase configuration can only be done at launch time.

For production environments, we recommend that you launch HBase on one cluster and launch any analysis tools, such as Hive, on a separate cluster. This ensures that HBase has ready access to the CPU and memory resources it requires.

To launch an HBase cluster using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/vnext/.

  2. Click Create Cluster.

    Create Cluster
  3. In the Create Cluster page, in the Cluster Configuration section, verify the fields according to the following table.

    Cluster Configuration
    FieldAction
    Cluster name

    Enter a descriptive name for your cluster.

    The name is optional, and does not need to be unique.

    Termination protection

    Enabling termination protection ensures that the cluster does not shut down due to accident or error. For more information, see Protect a Cluster from Termination. Typically, set this value to Yes only when developing an application (so you can debug errors that would have otherwise terminated the cluster) and to protect long-running clusters or clusters that contain data.

    Logging

    This determines whether Amazon EMR captures detailed log data to Amazon S3.

    For more information, see View Log Files.

    Log folder S3 location

    Enter an Amazon S3 path to store your debug logs if you enabled logging in the previous field. If the log folder does not exist, the Amazon EMR console will create it for you.

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

    Debugging

    This option creates a debug log index in SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.

  4. In the Software Configuration section, verify the fields according to the following table.

    Software Configuration
    FieldAction
    Hadoop distribution

    Choose Amazon.

    This determines which distribution of Hadoop to run on your cluster. You can choose to run the Amazon distribution of Hadoop or one of several MapR distributions. For more information, see Using the MapR Distribution for Hadoop.

    AMI version

    Choose 2.4.2 (Hadoop 1.0.3).

    This determines the version of Hadoop and other applications such as Hive or Pig to run on your cluster. For more information, see Choose a Machine Image.

  5. Under the Additional Applications list, choose HBase and click Configure and add.

  6. In the Add Application section, indicate whether you want to pre-load the HBase cluster with data stored in Amazon S3 and whether you want to schedule regular backups of your HBase cluster. Use the following table for guidance on making your selections. For more information about backing up and restoring HBase data, see Back Up and Restore HBase.

    FieldAction
    Restore from backupSpecify whether to pre-load the HBase cluster with data stored in Amazon S3.
    Backup location Specify the URI where the backup to restore from resides in Amazon S3.
    Backup versionOptionally, specify the version name of the backup at Backup Location to use. If you leave this field blank, Amazon EMR uses the latest backup at Backup Location to populate the new HBase cluster.
    Schedule Regular BackupsSpecify whether to schedule automatic incremental backups. The first backup will be a full backup to create a baseline for future incremental backups.
    Consistent backupSpecify whether the backups should be consistent. A consistent backup is one which pauses write operations during the initial backup stage, synchronization across nodes. Any write operations thus paused are placed in a queue and resume when synchronization completes.
    Backup frequencyThe number of Days/Hours/Minutes between scheduled backups.
    Backup locationThe Amazon S3 URI where backups will be stored. The backup location for each HBase cluster should be different to ensure that differential backups stay correct.
    Backup start timeSpecify when the first backup should occur. You can set this to now, which causes the first backup to start as soon as the cluster is running, or enter a date and time in ISO format. For example, 2012-06-15T20:00Z, would set the start time to June 15, 2012 at 8pm UTC.
  7. In the Hardware Configuration section, verify the fields according to the following table.

    Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. Exceeding this limit will result in cluster failures. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. Ensure that your requested limit increase includes sufficient capacity for any temporary, unplanned increases in your needs. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    Hardware Configuration
    FieldAction
    Network

    Choose the default VPC. For more information about the default VPC, see Your Default VPC and Subnets.

    Optionally, if you have created additional VPCs you can choose your preferred VPC subnet identifier from the list to launch the cluster in that Amazon VPC. For more information, see Select a Amazon VPC Subnet for the Cluster (Optional).

    EC2 Availability Zone

    Choose No preference.

    Optionally, you can launch the cluster in a specific EC2 Availability Zone.

    For more information, see Regions and Availability Zones in the Amazon EC2 User Guide.

    Master

    Choose m1.small.

    The master node assigns Hadoop tasks to core and task nodes, and monitors their status. There is always one master node in each cluster.

    This specifies the EC2 instance types to use as master nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run master nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Core

    Choose m1.small.

    A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

    This specifies the EC2 instance types to use as core nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    This tutorial uses small instances for all nodes due to the light workload and to keep your costs low.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 2.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run core nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

    Task

    Choose m1.small.

    Task nodes only process Hadoop tasks and don't store data. You can add and remove them from a cluster to manage the EC2 instance capacity your cluster uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

    This specifies the EC2 instance types to use as task nodes. Valid types: m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.

    For more information, see Instance Groups. For information about mapping legacy clusters to instance groups, see Mapping Legacy Clusters to Instance Groups.

    Count

    Choose 0.

    Request Spot Instances

    Leave this box unchecked.

    This specifies whether to run task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional).

  8. In the Security and Access section, complete the fields according to the following table.

    Security and Access
    FieldAction
    EC2 key pair

    Choose an Amazon EC2 key pair from the list.

    For more information, see Create an Amazon EC2 Key Pair and PEM File.

    Optionally, choose Proceed without an EC2 key pair. If you do not enter a value in this field, you cannot use SSH to connect to the master node. For more information, see Connect to the Cluster.

    IAM user access

    Choose No other IAM users.

    Optionally, choose All other IAM users to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions.

    IAM role

    Choose Proceed without role.

    This controls application access to the EC2 instances in the cluster.

    For more information, see Configure IAM Roles for Amazon EMR.

  9. In the Bootstrap Actions section, there are no bootstrap actions necessary for this sample configuration.

    Optionally, you can use bootstrap actions, which are scripts that can install additional software and change the configuration of applications on the cluster before Hadoop starts. For more information, see Create Bootstrap Actions to Install Additional Software (Optional).

  10. In the Steps section, you do not need to change any of these settings.

  11. Review your configuration and if you are satisfied with the settings, click Create Cluster.

  12. When the cluster starts, you see the Cluster Details page.

    Cluster Starting

To launch an HBase cluster using the CLI

  • Specify --hbase when you launch a cluster using the CLI.

    The following example shows how to launch a cluster running HBase from the CLI. We recommend that you run at least two instances in the HBase cluster. The --instance-type parameter must be one of the following: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hi1.4xlarge, hs1.8xlarge, or cc2.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) region. The hi1.4xlarge instance type is only supported in the US East (Northern Virginia) and EU (Ireland) regions.

    The CLI implicitly launches the HBase cluster with keep alive and termination protection set.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --hbase --name "$USER HBase Cluster" \
      --num-instances 2 \
      --instance-type cc1.4xlarge
    • Windows users:

      ruby elastic-mapreduce --create --hbase --name "$USER HBase Cluster" --num-instances 2 --instance-type cc1.4xlarge

To launch an HBase cluster using the API

  • You need to run the hbase-setup bootstrap action when you launch HBase using the API in order to install and configure HBase on the cluster. You also need to add a step to start the Hbase master. These are shown in the following example.

    The region, us-east-1, would be replaced by the region in which to launch the cluster. For a list of regions supported by Amazon EMR see Choose an AWS Region.

    Note

    If you are using Hadoop 2.2.0 or newer, in the following example, use hbase.jar instead of hbase-0.92.0.jar.

    https://us-east-1elasticmapreduce.amazonaws.com?
    Operation=RunJobFlow&
    Name=HBase Cluster&
    LogUri=s3://myawsbucket/subdir&
    Instances.MasterInstanceType=m1.xlarge&
    Instances.SlaveInstanceType=m1.xlarge&
    Instances.InstanceCount=4&
    Instances.Ec2KeyName=myec2keyname&
    Instances.Placement.AvailabilityZone=us-east-1a&
    Instances.KeepJobFlowAliveWhenNoSteps=true&
    Steps.member.1.Name=InstallHBase&
    Steps.member.1.ActionOnFailure=CONTINUE&
    Steps.member.1.bootstrapAction.ScriptbootstrapAction=s3://us-east-1.elasticmapreduce/bootstrap-actions&
    Steps.member.1.Name=StartHBase&
    Steps.member.1.ActionOnFailure=CANCEL_AND_WAIT&
    Steps.member.1.HadoopJarStep.Jar=/home/hadoop/lib/hbase-0.92.0.jar&
    Steps.member.1.HadoopJarStep.Args.member.1=emr.hbase.backup.Main&
    Steps.member.1.HadoopJarStep.Args.member.2=--start-master&
    AWSAccessKeyId=AccessKeyID&
    SignatureVersion=2&
    SignatureMethod=HmacSHA256&
    Timestamp=2009-01-28T21%3A48%3A32.000Z&  
    Signature=calculated value