Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Build Binaries Using Amazon EMR

You can use Amazon Elastic MapReduce (Amazon EMR) as a build environment to compile programs for use in your cluster. Programs that you use with Amazon EMR must be compiled on a system running the same version of Debian used by Amazon EMR. For a 32-bit version, (m1.small and m1.medium) you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a 64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation options turned. For more information about EC2 instance versions, go to Virtual Server Configurations. Supported programming languages include C++, Cython, and C#.

The following table outlines the steps involved to build and test your application using Amazon EMR.

Process for Building a Module

1 Create an interactive cluster.
2 Identify the cluster ID and Public DNS name of the master node.
3 SSH as the Hadoop user to the master node of your Hadoop cluster.
4 Copy source files to the master node.
5 Build binaries with any necessary optimizations.
6 Copy binaries from the master node to Amazon S3.
7 Close the SSH connection.
8 Terminate the cluster.

The details for each of these steps are covered in the sections that follow.

To create an interactive cluster

  • Create an interactive cluster with a single node Hadoop cluster using the desired instance type:

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Interactive Cluster" \
      --num-instances=1 --master-instance-type=m1.large --hive-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Interactive Cluster" --num-instances=1 --master-instance-type=m1.large --hive-interactive

    The output looks similar to:

    Created jobflow JobFlowID	

To identify the cluster ID and Public DNS name of the master node

  • Identify your cluster:

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --list
      • Windows users:

        ruby elastic-mapreduce --list

    The output looks similar to the following.

    j-SLRI9SCLK7UC          STARTING    ec2-75-101-168-82.compute-1.amazonaws.com
      Interactive Cluster  PENDING     Hive Job

    The response includes the cluster ID and the public DNS name. You use this information to connect to the master node.

    Typically you need to wait one or two minutes after launching the cluster before the public DNS name is assigned.

To SSH as the Hadoop user to the master node

  • Use your credentials created for your Amazon EC2 key pair to log in to the master node:

    Instructions for creating credentials are located at Configuring Credentials.

    1. In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

      • Linux, UNIX, and Mac OS X users:

        ./elastic-mapreduce --ssh --jobflow JobFlowID
      • Windows users:

        1. Start PuTTY.

        2. Select Session in the Category list. Enter hadoop@DNS in the Host Name field. In this example, the input looks similar to hadoop@ec2-75-101-168-82.compute-1.amazonaws.com.

        3. In the Category list, expand Connection, expand SSH, and then select Auth. The Options controlling SSH authentication pane appears.

        4. Click Browse for Private key file for authentication, and select the private key file you generated earlier. If you are following this guide, the file name is mykeypair.ppk.

        5. Click OK.

        6. Click Open to connect to your master node.

        7. A PuTTY Security Alert pops up. Click Yes.

    When you successfully connect to the master node, the output looks similar to the following:

    Using username "hadoop".
    Authenticating with public key "imported-openssh-key"
    Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
    --------------------------------------------------------------------------------
    
    Welcome to Amazon EMR running Hadoop and Amazon Linux.
    
    Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
    /mnt/var/log/hadoop/steps for diagnosing step failures.
    
    The Hadoop UI can be accessed via the following commands:
    
      ResourceManager    lynx http://localhost:9026/
      NameNode           lynx http://localhost:9101/
    
    --------------------------------------------------------------------------------

To copy source files to the master node

  • Copy your source files to the master node:

    1. Put your source files on Amazon S3. To learn how to create buckets and move files with Amazon S3, go to the Amazon Simple Storage Service Getting Started Guide.

    2. Create a folder on your Hadoop cluster for your source files by entering a command similar to the following:

      mkdir Source Destination

      You now have a destination folder for your source files.

    3. Copy your sources files from Amazon S3 to the Hadoop cluster by entering a command similar to the following:

      hadoop fs -get s3://myawsbucket/SourceFiles SourceDestination

      Your source files are now located in your destination folder on the master node of your Hadoop cluster.

Build binaries with any necessary optimizations

How you build your binaries code depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. You can use Hadoop system specification commands to obtain cluster information to determine how to install your build environment.

To identify system specifications

  • Use the following commands to verify the architecture you are using to build your binaries:

    1. To view the version of Debian, enter the following command:

      master$ cat /etc/issue

      The output looks similar to the following.

      Debian GNU/Linux 5.0
    2. To view the public DNS name and processor size, enter the following command:

      master$ uname -a

      The output looks similar to the following.

      Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux
    3. To view the processor speed, enter the following command:

      master$ cat /proc/cpuinfo

      The output looks similar to the following.

      processor : 0
      vendor_id : GenuineIntel
      model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
      flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
      ... 

Once your binaries are built, you can copy the files to Amazon S3.

To copy binaries from the master node to Amazon S3

  • Copy the binaries to Amazon S3 by entering the following command:

    hadoop fs -put BinaryFiles s3://myawsbucket/BinaryDestination

    Your binaries are now stored in your Amazon S3 bucket.

To close the SSH connection

  • Enter the following command from the Hadoop command-line prompt:

    • exit

    You are no longer connected to your cluster via SSH.

To terminate the cluster

  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      ruby elastic-mapreduce --terminate JobFlowID

    Your cluster is terminated.

    Important

    Terminating a cluster delete all files and executables saved to the cluster. Remember to save all required files before terminating a cluster.