Menu
Amazon EMR
Management Guide

Tutorial: Launch an Amazon EMR Cluster with MapR M7

This tutorial guides you through launching an Amazon EMR cluster featuring the M7 edition of the MapR distribution for Hadoop. The MapR distribution for Hadoop is a complete Hadoop distribution that provides many unique capabilities, such as industry-standard NFS and ODBC interfaces, end-to-end management, high reliability, and automatic compression. You can manage a MapR cluster through the web-based MapR Control System, the command line, or a REST API. M7 provides enterprise-grade capabilities such as high availability, snapshot and mirror volumes, and native MapR table functionality on MapR-FS, enabling responsive HBase-style flat table databases compatible with snapshots and mirroring. It provides a single platform for storing and processing unstructured and structured data, integrated with existing infrastructure, applications, and tools.

Note

To use the commands in this tutorial, download and install the AWS CLI. For more information see Installing the AWS CLI in the AWS Command Line Interface User Guide.

  1. To launch a cluster with MapR, m7 edition, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      Copy
      aws emr create-cluster --name "MapR cluster" --applications Name=Hive Name=Pig \ Name=MapR,Args=--edition,m7,--version,4.0.2 --ami-version 3.3.2 \ --use-default-roles --ec2-attributes KeyName=myKey \ --instance-type m1.xlarge --instance-count 3
    • Windows users:

      Copy
      aws emr create-cluster --name "MapR cluster" --applications Name=Hive Name=Pig Name=MapR,Args=--edition,m7,--version,3.1.1 --ami-version 2.4 --use-default-roles --ec2-attributes KeyName=myKey --instance-type m1.xlarge --instance-count 3

    Note

    The versions of MapR available in Amazon EMR do not currently support Hadoop 2.x. When specifying the --ami-version, use a Hadoop 1.x AMI.

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    After you run the command, the cluster takes between five and ten minutes to start. The aws emr list-clusters command shows your cluster in the STARTING and BOOTSTRAPPING states before entering the WAITING state.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

  2. Retrieve the cluster ID and then use SSH to connect to the cluster. In the following commands, replace j-2AL4XXXXXX5T9 with your cluster ID, and replace ~/mykeypair.key with the path to and filename of your key pair private key file.

    Copy
    aws emr list-clusters aws emr ssh --cluster-id j-2AL4XXXXXX5T9 --key-pair-file ~/mykeypair.key

    Note

    For more information about accessing a cluster with SSH, see Connect to the Master Node Using SSH.

  3. MapR provides volumes as a way to organize data and manage cluster performance. A volume is a logical unit that allows you to apply policies to a set of files, directories, and sub-volumes. You can use volumes to enforce disk usage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments. Create a volume for each user, department, or project. You can mount volumes under other volumes to build a structure that reflects the needs of your organization. The volume structure defines how data is distributed across the nodes in your cluster.

    Run the following command after connecting to your cluster over SSH to create a volume:

    Copy
    $ maprcli volume create -name tables -replicationtype low_latency -path /tables
  4. The M7 Edition of the MapR distribution for Hadoop enables you to create and manipulate tables in many of the same ways that you create and manipulate files in a standard UNIX file system. In the M7 Edition, tables share the same namespace as files and are specified with full path names. You can create mappings between the table names used in Apache HBase and M7 Edition's native tables.

    1. Create a table with the following command:

      Copy
      $ echo "create '/tables/user_table', 'family' " | hbase shell
    2. List tables with the following command:

      Copy
      $ hadoop fs -ls /tables Found 1 items trwxr-xr-x 3 root root 2 2013-04-16 22:49 /tables/user_table $ ls /mapr/MapR_EMR.amazonaws.com/tables user_table
    3. Move or rename tables using the following command:

      Copy
      hadoop fs -mv /tables/user_table /tables/usertable
  5. A snapshot is a read-only image of a volume at a specific point in time. Snapshots are useful any time you need to be able to roll back to a known good data set at a specific point in time.

    1. From the HBase shell, add a row to the newly-created table:

      Copy
      $ hbase shell hbase(main):001:0> put '/tables/usertable', 'row_1' , 'family:child', 'username' output: 0 row(s) in 0.0920 seconds
    2. Create the snapshot:

      Copy
      $ maprcli volume snapshot create -volume tables -snapshotname mysnap
    3. Change the table:

      Copy
      hbase(main):002:0> put '/tables/usertable', 'row_1' , 'family:location', 'San Jose'
    4. Snapshots are stored in a .snapshot directory. Scan the table from the snapshot and the current table to see the difference:

      Copy
      hbase shell > scan '/tables/usertable' ROW COLUMN+CELL row_1 column=family:child, timestamp=1366154016042, value=username row_1 column=family:home, timestamp=1366154055043, value=San Jose 1 row(s) in 0.2910 seconds scan ‘/tables/.snapshot/mysnap/usertable’ ROW COLUMN+CELL row_1 column=family:child, timestamp=1366154016042, value=username 1 row(s) in 0.2320 seconds
  6. Test high availability:

    1. List the current regions on your system.

      Copy
      $ maprcli table region list -path /tables/usertable secondarynodes scans primarynode puts startkey gets lastheartbeat endkey ip-10-191-5-21.ec2.internal, ip-10-68-37-140.ec2.internal ... ip-10-4-74-175.ec2.internal ... -INFINITY ... 0 INFINITY
    2. Restart the primary node for one of the regions. Make sure that the primary node is not the access point to the cluster. Restarting your access point will result in loss of cluster access and terminate your YCSB client.

      Connect to the cluster with SSH and restart the node with the following command:

      Copy
      $ ssh -i /Users/username/downloads/MyKey_Context.pem hadoop@ec2-23-20-100-174.compute-1.amazonaws.com $ sudo /sbin/reboot

      Note

      The restart will take 15 to 30 seconds to complete.

    3. After the restart is complete, list your new regions to see that the former primary node is now listed as secondary.

      Copy
      $ maprcli table region list -path /tables/usertable secondarynodes scans primarynode puts startkey gets lastheartbeat endkey ip-10-191-5-21.ec2.internal, ip-10-68-37-140.ec2.internal ... ip-10-4-74-175.ec2.internal ... -INFINITY ... 0 INFINITY
  7. To open the MapR Control System page, navigate to the address https://hostname.compute-1.amazonaws.com:8453. The username and password for the default installation are both hadoop. The URL for your node's hostname appears in the message-of-the-day that displays when you first log in to the node over SSH.

    The Nodes view displays the nodes in the cluster, by rack. The Nodes view contains two panes: the Topology pane and the Nodes pane. The Topology pane shows the racks in the cluster. Selecting a rack displays that rack's nodes in the Nodes pane to the right. Selecting Cluster displays all the nodes in the cluster. Clicking any column name sorts data in ascending or descending order by that column. If your YCSB job is still running, you can see the put streams continuing from the Nodes view.