Menu
Amazon EMR
Amazon EMR Release Guide

Configure Spark

You can configure Spark on Amazon EMR using the spark-defaults configuration classification. For more information about the options, see the Spark Configuration topic in the Apache Spark documentation. It is also possible to configure Spark dynamically at the time of each application submission. For more information, see Enabling Dynamic Allocation of Executors.

Spark Defaults Set By Amazon EMR

The following defaults are set by Amazon EMR regardless of whether other settings are set to true, such as spark.dynamicAllocation.enabled or maximizeResourceAllocation.

  • spark.executor.memory

  • spark.executor.cores

Note

In releases 4.4.0 or greater, spark.dynamicAllocation.enabled is set to true by default.

The following table shows how Spark defaults that affect applications are set.

Spark defaults set by Amazon EMR

SettingDescriptionValue
spark.executor.memoryAmount of memory to use per executor process. (for example, 1g, 2g)

Setting is configured based on the slave instance types in the cluster.

spark.executor.coresThe number of cores to use on each executor. Setting is configured based on the slave instance types in the cluster.
spark.dynamicAllocation.enabledWhether to use dynamic resource allocation, which scales the number of executors registered with an application up and down based on the workload.

true (emr-4.4.0 or greater)

Note

Spark Shuffle Service is automatically configured by Amazon EMR.


You can configure your executors to utilize the maximum resources possible on each node in a cluster by enabling the maximizeResourceAllocation option when creating the cluster. This option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information.

Spark defaults set when maximizeResourceAllocation is enabled

SettingDescriptionValue
spark.default.parallelismDefault number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

2X number of CPU cores available to YARN containers.

spark.driver.memoryAmount of memory to use for the driver process, i.e. where SparkContext is initialized. (for example, 1g, 2g).

Setting is configured based on the instance types in the cluster. However, because the Spark driver application may run on either the master or one of the core instances (for example, in YARN client and cluster modes, respectively), this is set based on the smaller of the instance types in these two instance groups.

spark.executor.memoryAmount of memory to use per executor process. (for example, 1g, 2g)

Setting is configured based on the slave instance types in the cluster.

spark.executor.coresThe number of cores to use on each executor. Setting is configured based on the slave instance types in the cluster.
spark.executor.instances The number of executors.

Setting is configured based on the slave instance types in the cluster. Set unless spark.dynamicAllocation.enabled explicitly set to true at the same time.


Enabling Dynamic Allocation of Executors

Spark on YARN has the ability to scale the number of executors used for a Spark application dynamically. In releases 4.4.0 or greater, this is the default behavior.

To learn more about dynamic allocation, see the Dynamic Allocation topic in the Apache Spark documentation.

Spark ThriftServer Environment Variable

Spark sets the Hive Thrift Server Port environment variable, HIVE_SERVER2_THRIFT_PORT, to 10001.

Changing Spark Default Settings

You change the defaults in spark-defaults.conf using the spark-defaults configuration classification when you create the cluster or the maximizeResourceAllocation setting in the spark configuration classification.

The following procedures show how to modify settings using the CLI or console.

To create a cluster with spark.executor.memory set to 2G using the CLI

  • Create a cluster with Spark installed and spark.executor.memory set to 2G, using the following:

    aws emr create-cluster --release-label emr-5.2.0 --applications Name=Spark \
    --instance-type m3.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

    Note

    For Windows, replace the above Linux line continuation character (\) with the caret (^).

    myConfig.json:

    [
        {
          "Classification": "spark-defaults",
          "Properties": {
            "spark.executor.memory": "2G"
          }
        }
      ]
    

    Note

    If you plan to store your configuration in Amazon S3, you must specify the URL location of the object. For example:

    aws emr create-cluster --release-label emr-5.2.0 --applications Name=Spark \
    --instance-type m3.xlarge --instance-count 3 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

To create a cluster with spark.executor.memory set to 2G using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster.

  3. Choose Go to advanced options

  4. For the Software Configuration field, choose Release or later.

  5. Choose either Spark or All Applications from the list, then choose Configure and add.

  6. Choose Edit software settings and enter the following configuration:

    classification=spark-defaults,properties=[spark.executor.memory=2G]
  7. Select other options as necessary and then choose Create cluster.

To set maximizeResourceAllocation

  • Create a cluster with Spark installed and maximizeResourceAllocation set to true using the AWS CLI:

    aws emr create-cluster --release-label emr-5.2.0 --applications Name=Spark \
    --instance-type m3.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations file://./myConfig.json

    Or using Amazon S3:

    aws emr create-cluster --release-label emr-5.2.0 --applications Name=Spark \
    --instance-type m3.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

    Note

    For Windows, replace the above Linux line continuation character (\) with the caret (^).

    myConfig.json:

    [
      {
        "Classification": "spark",
        "Properties": {
          "maximizeResourceAllocation": "true"
        }
      }
    ]