AWS Glue Spark shuffle manager with Amazon S3 - AWS Glue

AWS Glue Spark shuffle manager with Amazon S3

Shuffling is an important step in a Spark job whenever data is rearranged between partitions. This is required because wide transformations such as join, groupByKey, reduceByKey, and repartition require information from other partitions to complete processing. Spark gathers the required data from each partition and combines it into a new partition. During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity. Spark throws a No space left on device or MetadataFetchFailedException error when there is not enough disk space left on the executor and there is no recovery.

Solution

With AWS Glue 2.0, you can now use Amazon S3 to store Spark shuffle and spill data. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This solution disaggregates compute and storage for your Spark jobs, and gives complete elasticity and low-cost shuffle storage, allowing you to run your most shuffle-intensive workloads reliably.

We are introducing a new shuffle manager (called GlueShuffleManager) which will write and read shuffle files to/from Amazon S3. You can turn on Amazon S3 shuffling to run your AWS Glue jobs reliably without failures if they are known to be bound by the local disk capacity for large shuffle operations. In some cases, shuffling to Amazon S3 is marginally slower than local disk (or EBS) if you have a large number of small partitions or shuffle files written out to Amazon S3.

Using AWS Glue Spark shuffle manager from the AWS Console

To set up the AWS Glue Spark shuffle manager using the AWS Glue console or AWS Glue Studio when configuring a job: choose the --write-shuffle-files-to-s3 job parameter to turn on Amazon S3 shuffling for the job.

Using AWS Glue Spark shuffle manager

The following job parameters turn on and tune the AWS Glue shuffle manager.

  • --write-shuffle-files-to-s3 — The main flag, which when true enables the AWS Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. When false, or not specified the shuffle manager is not used.

  • --write-shuffle-spills-to-s3 — An optional flag that when true allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job. This is only required for large workloads that spill a lot of data to disk. When false, no intermediate spill files are written. This flag is disabled by default.

  • --conf spark.shuffle.glue.s3ShuffleBucket=S3://<shuffle-bucket> — Another optional flag that specifies the Amazon S3 bucket where you write the shuffle files. By default, --TempDir/shuffle-data.

AWS Glue supports all other shuffle related configurations provided by Spark.

Notes and limitations

The following are notes or limitations for the AWS Glue shuffle manager:

  • AWS Glue currently supports the shuffle manager only on AWS Glue version 2.0, and Spark 2.4.3.

  • Make sure the location of the shuffle bucket is in the same AWS Region in which the job runs.

  • Set the Amazon S3 storage lifecycle policies on the prefix as the shuffle manager does not clean the files after the job is done.

  • You can use this feature if your data is skewed.