Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Build Binaries Using Amazon EMR

You can use Amazon Elastic MapReduce (Amazon EMR) as a build environment to compile programs for use in your cluster. Programs that you use with Amazon EMR must be compiled on a system running the same version of Linux used by Amazon EMR. For a 32-bit version, (m1.small and m1.medium) you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a 64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation options turned. For more information about EC2 instance versions, go to Virtual Server Configurations. Supported programming languages include C++, Cython, and C#.

The following table outlines the steps involved to build and test your application using Amazon EMR.

Process for Building a Module

1 Create a long-running cluster.
2 Connect to the master node of your cluster.
3 Copy source files to the master node.
4 Build binaries with any necessary optimizations.
5 Copy binaries from the master node to Amazon S3.
6 Terminate the cluster.

The details for each of these steps are covered in the sections that follow.

To create a long-running cluster using the AWS CLI

  • Type the following command to launch a long-running cluster:

    aws emr create-cluster --no-auto-terminate --name string --applications Name=string \ 
    --instance-groups InstanceGroupType=string,InstanceType=string,InstanceCount=integer

    For example, this command launches a cluster with 1 core node and with Hive installed:

    aws emr create-cluster --no-auto-terminate --name "Long Running Cluster" --applications Name=Hive \ 
    --instance-groups InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 \
    InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=1

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To create a long-running cluster using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Create a long-running, single-node cluster.

    In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Interactive Cluster" \
      --num-instances=1 --master-instance-type=m1.large --hive-interactive
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Interactive Cluster" --num-instances=1 --master-instance-type=m1.large --hive-interactive

    The output looks similar to:

    Created jobflow JobFlowID	

To connect to the master node of the cluster

To copy source files to the master node

  1. Put your source files in an Amazon S3 bucket. To learn how to create buckets and how to move data into Amazon S3, go to the Amazon Simple Storage Service Getting Started Guide.

  2. Create a folder on your Hadoop cluster for your source files by entering a command similar to the following:

    mkdir SourceFiles
  3. Copy your source files from Amazon S3 to the master node by typing a command similar to the following:

    hadoop fs -get s3://mybucket/SourceFiles SourceFiles

Build binaries with any necessary optimizations

How you build your binaries depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. You can use Hadoop system specification commands to obtain cluster information to determine how to install your build environment.

To identify system specifications

  • Use the following commands to verify the architecture you are using to build your binaries.

    1. To view the version of Debian, enter the following command:

      master$ cat /etc/issue

      The output looks similar to the following.

      Debian GNU/Linux 5.0
    2. To view the public DNS name and processor size, enter the following command:

      master$ uname -a

      The output looks similar to the following.

      Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux
    3. To view the processor speed, enter the following command:

      master$ cat /proc/cpuinfo

      The output looks similar to the following.

      processor : 0
      vendor_id : GenuineIntel
      model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
      flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
      ... 

Once your binaries are built, you can copy the files to Amazon S3.

To copy binaries from the master node to Amazon S3

  • Type the following command to copy the binaries to your Amazon S3 bucket:

    hadoop fs -put BinaryFiles s3://mybucket/BinaryDestination

To terminate the cluster using the AWS CLI

  • Type the following command to terminate the cluster:

    aws emr terminate-clusters clusterId 

    Important

    Terminating a cluster deletes all files and executables saved to the cluster. Remember to save all required files before terminating a cluster.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To terminate the cluster using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      ruby elastic-mapreduce --terminate JobFlowID

    Important

    Terminating a cluster deletes all files and executables saved to the cluster. Remember to save all required files before terminating a cluster.