| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
This section covers the basics of creating a cluster using a custom JAR file in Amazon Elastic MapReduce (Amazon EMR).You'll step through how to create a cluster using a Custom JAR with either the Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for more information see Prepare Input Data (Optional).
A cluster using a custom JAR file enables you to write a script to process your data using the Java programming language. The example that follows is based on the Amazon EMR sample: CloudBurst.
In this example, the JAR file is located in an Amazon S3 bucket at
s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar. All of the
data processing instructions are located in the JAR file and the script is referenced by
the main class org.myorg.WordCount. The input data is located in the Amazon
S3 bucket s3n://elasticmapreduce/samples/cloudburst/input. The output is
saved to an Amazon S3 bucket you created as part of Prepare an Output Location (Optional).
This example describes how to use the Amazon EMR console to create a cluster using a custom JAR file.
To create a cluster using a custom JAR file
Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
Click Create New Job Flow.

In the DEFINE JOB FLOW page, enter the following in the Define Job Flow section of the Create a New Job Flow dialog box:
Enter a name in the Job Flow Name field.
We recommended you use a descriptive name. It does not need to be unique.
Select which version of Hadoop to run on your cluster in Hadoop Version. You can choose to run the Amazon distribution of Hadoop or one of two MapR distributions. For more information about MapR distributions for Hadoop, see Using the MapR Distribution for Hadoop.
Select Run your own application.
Select Custom JAR in the drop-down list.
Click Continue.

In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.
| Field | Action |
|---|---|
| JAR Location* | Specify the URI where your script resides in Amazon S3. The
value must be in the form
BucketName/path/ScriptName.
|
| JAR Arguments* | Enter a list of arguments (space-separated strings) to pass to the JAR file. |
* Required parameter

In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the following table as a guide, and then click Continue.
Note
Twenty is the default maximum number of nodes per AWS account. For example, if you have two clusters running, the total number of nodes running for both clusters must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.
| Field | Action |
|---|---|
| Instance Count | Specify the number of nodes to use in the Hadoop cluster. There is always one master node in each cluster. You can specify the number of core and tasks nodes. |
| Instance Type | Specify the Amazon EC2 instance types to use as master, core, and task nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hi1.4xlarge, hs1.8xlarge, and cg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions. The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (Northern Virginia) region. The hi1.4xlarge instance type is only supported in the US East (Northern Virginia) and EU (Ireland) regions. |
| Request Spot Instances | Specify whether to run master, core, or task nodes on Spot Instances. For more information, see Lower Costs with Spot Instances (Optional) |
* Required parameter

In the ADVANCED OPTIONS page, set additional configuration options, using the following table as a guide, and then click Continue.
| Field | Action |
|---|---|
| Amazon EC2 Key Pair | Optionally, specify a key pair that you created previously. For more information, see Create an Amazon EC2 Key Pair and PEM File. If you do not enter a value in this field, you cannot SSH into the master node. |
| Amazon VPC Subnet Id |
Optionally, specify a VPC subnet identifier to launch the cluster in an Amazon VPC. Set this only if you need to launch the cluster into a specific VPC subnet, otherwise you can leave this set to the default: No preference. For more information about how Amazon VPC integrates with Amazon EMR, see Select a Amazon VPC Subnet for the Cluster (Optional). |
| Amazon S3 Log Path | Optionally, specify a path in Amazon S3 to store the
Amazon EMR log files. The value must be in the form
BucketName/path.
If you do not supply a location, Amazon EMR does not log
any files. |
| Enable debugging | Select Yes to store Amazon Elastic MapReduce-generated log
files. You must enable debugging at this level if you want
to store the log files generated by Amazon EMR. If you select Yes, you must supply an Amazon S3 bucket name where Amazon Elastic MapReduce can upload your log files. For more information, see Troubleshoot a Cluster. Important You can enable debugging for a cluster only when you initially create the cluster. |
| Keep Alive | Select Yes to cause the cluster to continue running when all processing is completed. |
| Termination Protection | Select Yes to ensure the cluster is not shut down due to accident or error. For more information, see Protect a Cluster from Termination. |
| Visible To All IAM Users | Select Yes to make the cluster visible and accessible to all IAM users on the AWS account. For more information, see Configure IAM User Permissions. |

In the bootstrap ACTIONS page, select Proceed with no bootstrap Actions, and then click Continue.
For more information about bootstrap actions, see Create Bootstrap Actions to Install Additional Software (Optional).

In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.
After you click Create Job Flow your request is processed; when it succeeds, a message appears.

Click Close.
The Amazon EMR console shows the new cluster starting. Starting a new cluster may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button for the latest view of the cluster's progress.

This section explains how to run a cluster that uses a custom JAR file.
To create a cluster using a Custom JAR
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --name "Test custom JAR" \
--jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
--arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
--arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
--arg s3n://myawsbucket/cloud \
--arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \
--arg 24 --arg 128 --arg 16Windows users:
ruby elastic-mapreduce --create --name "Test custom JAR" --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br --arg s3n://myawsbucket/cloud --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24--arg 24 --arg 128 --arg 16Note
The individual --arg values above could also be represented as --args followed by a comma-separated list, as shown in the preceding examples.
The output looks similar to the following.
Created cluster JobFlowIDBy default, this command launches a cluster to run on a single-node cluster using
an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a
small set of sample data, you can launch clusters to run on multiple nodes. You can
specify the number of nodes and the type of instance to run with the
--num-instances and --instance-type
parameters, respectively.
This section describes the Amazon EMR API Query request parameters you
need to create a cluster using a custom JAR file. For an explanation of the
parameters unique to RunJobFlow, see RunJobFlow. The response includes a <JobFlowID>, which you use in
other Amazon EMR operations, such as when describing or terminating a cluster.
For this reason, it is important to store cluster IDs.
To start a cluster specifying a JAR file, send a RunJobFlow
request similar to the following.
https://elasticmapreduce.amazonaws.com? Operation=RunJobFlow& Name=Test custom JAR& LogUri=s3://myawsbucket/subdir& Instances.MasterInstanceType=m1.small& Instances.SlaveInstanceType=m1.small& Instances.InstanceCount=4& Instances.Ec2KeyName=myec2keyname& Instances.Placement.AvailabilityZone=us-east-1a& Instances.KeepJobFlowAliveWhenNoSteps=true& Steps.member.1.Name=MyStepName& Steps.member.1.ActionOnFailure=CONTINUE& Steps.member.1.HadoopJarStep.Jar=s3://elasticmapreduce/samples/cloudburst/cloudburst.jar& Steps.member.1.HadoopJarStep.MainClass=MyMainClass& Steps.member.1.HadoopJarStep.Args.member.1=arg1& Steps.member.1.HadoopJarStep.Args.member.2=arg2& AWSAccessKeyId=AccessKeyID& SignatureVersion=2& SignatureMethod=HmacSHA256& Timestamp=2009-01-28T21%3A48%3A32.000Z& Signature=calculated value