Menu
Amazon EMR
Amazon EMR Release Guide

Adding a Spark Step

You can use Amazon EMR Steps in the Amazon EMR Management Guide to submit work to the Spark framework installed on an EMR cluster. In the console and CLI, you do this using a Spark application step, which runs the spark-submit script as a step on your behalf. With the API, you use a step to invoke spark-submit using command-runner.jar.

For more information about submitting applications to Spark, see the Submitting Applications topic in the Apache Spark documentation.

Note

If you choose to deploy work to Spark using the client deploy mode, your application files must be in a local path on the EMR cluster. You cannot currently use S3 URIs for this location in client mode. However, you can use S3 URIs with cluster deploy mode.

To submit a Spark step using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. In the Cluster List, choose the name of your cluster.

  3. Scroll to the Steps section and expand it, then choose Add step.

  4. In the Add Step dialog box:

    • For Step type, choose Spark application.

    • For Name, accept the default name (Spark application) or type a new name.

    • For Deploy mode, choose Cluster or Client mode. Cluster mode launches your driver program on the cluster (for JVM-based programs, this is main()), while client mode launches the driver program locally. For more information, see Cluster Mode Overview in the Apache Spark documentation.

      Note

      Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node.

    • Specify the desired Spark-submit options. For more information about spark-submit options, see Launching Applications with spark-submit.

    • For Application location, specify the local or S3 URI path of the application.

    • For Arguments, leave the field blank.

    • For Action on failure, accept the default option (Continue).

  5. Choose Add. The step appears in the console with a status of Pending.

  6. The status of the step changes from Pending to Running to Completed as the step runs. To update the status, choose the Refresh icon above the Actions column.

  7. The results of the step are located in the Amazon EMR console Cluster Details page next to your step under Log Files if you have logging configured. You can optionally find step information in the log bucket you configured when you launched the cluster.

To submit work to Spark using the AWS CLI

Submit a step when you create the cluster or use the aws emr add-steps subcommand in an existing cluster.

  1. Use create-cluster.

    Copy
    aws emr create-cluster --name "Add Spark Step Cluster" --release-label emr-5.6.0 --applications Name=Spark \ --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 \ --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10] --use-default-roles

    An alternative using command-runner.jar:

    Copy
    aws emr create-cluster --name "Add Spark Step Cluster" --release-label emr-5.6.0 \ --applications Name=Spark --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 \ --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-example,SparkPi,10] --use-default-roles

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

  2. Alternatively, add steps to a cluster already running. Use add-steps.

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10]

    An alternative using command-runner.jar:

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-example,SparkPi,10]

To submit work to Spark using the SDK for Java

  • To submit work to a cluster, use a step to run the spark-submit script on your EMR cluster. Add the step using the addJobFlowSteps method in AmazonElasticMapReduceClient:

    Copy
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials); StepFactory stepFactory = new StepFactory(); AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials); AddJobFlowStepsRequest req = new AddJobFlowStepsRequest(); req.withJobFlowId("j-1K48XXXXXXHCB"); List<StepConfig> stepConfigs = new ArrayList<StepConfig>(); HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig() .withJar("command-runner.jar") .withArgs("spark-submit","--executor-memory","1g","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/lib/spark-examples.jar","10"); StepConfig sparkStep = new StepConfig() .withName("Spark Step") .withActionOnFailure("CONTINUE") .withHadoopJarStep(sparkStepConf); stepConfigs.add(sparkStep); req.withSteps(stepConfigs); AddJobFlowStepsResult result = emr.addJobFlowSteps(req);

View the results of the step by examining the logs for the step. You can do this in the AWS Management Console if you have enabled logging by choosing Steps, selecting your step, and then, for Log files, choosing either stdout or stderr. To see the logs available, choose View Logs.

Overriding Spark Default Configuration Settings

You may want to override Spark default configuration values on a per-application basis. You can do this when you submit applications using a step, which essentially passes options to spark-submit. For example, you may wish to change the memory allocated to an executor process by changing spark.executor.memory. You would supply the --executor-memory switch with an argument like the following:

Copy
spark-submit --executor-memory 1g --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10

Similarly, you can tune --executor-cores and --driver-memory. In a step, you would provide the following arguments to the step:

Copy
--executor-memory 1g --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10

You can also tune settings that may not have a built-in switch using the --conf option. For more information about other settings that are tunable, see the Dynamically Loading Spark Properties topic in the Apache Spark documentation.