Using shuffle-optimized disks - Amazon EMR

Using shuffle-optimized disks

With Amazon EMR releases 7.1.0 and higher, you can use shuffle-optimized disks when you run Apache Spark or Hive jobs to improve performance for I/O-intensive workloads. Compared to standard disks, shuffle-optimized disks provide higher IOPS (I/O operations per second) for faster data movement and reduced latency during shuffle operations. Shuffle-optimized disks let you attach disk sizes of up to 2 TB per worker, so you can configure the appropriate capacity for your workload requirements.

Key benefits

Shuffle-optimized disks provide the following benefits.

  • High IOPS performance – shuffle-optimized disks provide higher IOPS than standard disks, leading to more efficient and rapid data shuffling during Spark and Hive jobs and other shuffle-intensive workloads.

  • Larger disk size – Shuffle-optimized disks support disk sizes from 20GB to 2TB per worker, so you can choose the appropriate capacity based on your workloads.

Getting started

See the following steps to use shuffle-optimized disks in your workflows.

Spark
  1. Create an EMR Serverless release 7.1.0 application with the following command.

    aws emr-serverless create-application \ --type "SPARK" \ --name my-application-name \ --release-label emr-7.1.0 \ --region <AWS_REGION>
  2. Configure your Spark job to include the parameters spark.emr-serverless.driver.disk.type and/or spark.emr-serverless.executor.disk.type to run with shuffle-optimized disks. You can use either one or both parameters, depending on your use case.

    aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "sparkSubmit": { "entryPoint": "/usr/lib/spark/examples/jars/spark-examples.jar", "entryPointArguments": ["1"], "sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1 --conf spark.emr-serverless.executor.disk.type=shuffle_optimized" } }'

    For more information, see Spark job properties.

Hive
  1. Create an EMR Serverless release 7.1.0 application with the following command.

    aws emr-serverless create-application \ --type "HIVE" \ --name my-application-name \ --release-label emr-7.1.0 \ --region <AWS_REGION>
  2. Configure your Hive job to include the parameters hive.driver.disk.type and/or hive.tez.disk.type to run with shuffle-optimized disks. You can use either one or both parameters, depending on your use case.

    aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "hive": { "query": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/query/hive-query.ql", "parameters": "--hiveconf hive.log.explain.output=false" } }' \ --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.exec.scratchdir": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/hive/scratch", "hive.metastore.warehouse.dir": "s3://<DOC-EXAMPLE-BUCKET>/emr-serverless-hive/hive/warehouse", "hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1", "hive.driver.disk.type": "shuffle_optimized", "hive.tez.disk.type": "shuffle_optimized" } }] }'

    For more information, Hive job properties.

Configuring an application with pre-initialized capacity

See the following examples to create applications based on Amazon EMR release 7.1.0. These applications have the following properties:

  • 5 pre-initialized Spark drivers, each with 2 vCPU, 4 GB of memory, and 50 GB of shuffle-optimized disk.

  • 50 pre-initialized executors, each with 4 vCPU, 8 GB of memory, and 500 GB of shuffle-optimized disk.

When this application runs Spark jobs, it first consumes the pre-initialized workers and then scales the on-demand workers up to the maximum capacity of 400 vCPU and 1024 GB of memory. Optionally, you can omit capacity for either DRIVER or EXECUTOR.

Spark
aws emr-serverless create-application \ --type "SPARK" \ --name <my-application-name> \ --release-label emr-7.1.0 \ --initial-capacity '{ "DRIVER": { "workerCount": 5, "workerConfiguration": { "cpu": "2vCPU", "memory": "4GB", "disk": "50GB", "diskType": "SHUFFLE_OPTIMIZED" } }, "EXECUTOR": { "workerCount": 50, "workerConfiguration": { "cpu": "4vCPU", "memory": "8GB", "disk": "500GB", "diskType": "SHUFFLE_OPTIMIZED" } } }' \ --maximum-capacity '{ "cpu": "400vCPU", "memory": "1024GB" }'
Hive
aws emr-serverless create-application \ --type "HIVE" \ --name <my-application-name> \ --release-label emr-7.1.0 \ --initial-capacity '{ "DRIVER": { "workerCount": 5, "workerConfiguration": { "cpu": "2vCPU", "memory": "4GB", "disk": "50GB", "diskType": "SHUFFLE_OPTIMIZED" } }, "EXECUTOR": { "workerCount": 50, "workerConfiguration": { "cpu": "4vCPU", "memory": "8GB", "disk": "500GB", "diskType": "SHUFFLE_OPTIMIZED" } } }' \ --maximum-capacity '{ "cpu": "400vCPU", "memory": "1024GB" }'