Tuning Spark SQL queries for AWS Glue and Amazon EMR Spark jobs
Phani Alapaty and Ravikiran Rao, Amazon Web Services (AWS)
January 2024 (document history)
Spark SQL
Joining data is one of the most common and important operations you can perform when extracting, transforming, or loading data into object stores or databases. When joining, you need to consider performance. There are multiple scenarios, such as large network transfers, when some of the join, analyze, or aggregate operations run out of memory. This can cause the AWS Glue Spark job to fail.
This guide provides best practices that help you tune Spark SQL join
queries for AWS Glue or Amazon EMR jobs. Spark provides many configuration options
that improve the performance of the Spark SQL workload. These adjustments can
be done programmatically, or you can apply them at the global level by using the
spark-submit
command. This guide explains some of these configurations so
that you can improve or fine-tune the performance of your Spark SQL queries
and applications. The recommendations in this guide are based on configurations that AWS
Professional Services uses to improve the performance of Spark SQL queries
and applications.
Intended audience
This guide helps architects, data engineers, data scientists, and developers understand the Spark SQL configuration options that improve the performance of Spark SQL queries.