Strategies for tuning Spark job performance
When preparing to tune parameters, use the following best practices:
-
Determine your performance goals before beginning to identify problems.
-
Use metrics to identify problems before attempting to change tuning parameters.
For the most consistent results when tuning a job, develop a baseline strategy for your tuning work.
Baseline strategy for performance tuning
Generally, performance tuning is performed in the following workflow:
-
Determine performance goals.
-
Measure metrics.
-
Identify bottlenecks.
-
Reduce the impact of the bottlenecks.
-
Repeat steps 2-4 until you achieve the intended target.
First, determine your performance goals. For example, one of your goals might be to complete the run of an AWS Glue job within 3 hours. After you define your goals, measure job performance metrics. Identify trends in metrics and bottlenecks to meet the goals. In particular, identifying bottlenecks is most important for troubleshooting, debugging, and performance tuning. During the run of a Spark application, Spark records the status and statistics of each task in the Spark event log.
In AWS Glue, you can view Spark metrics through the Spark Web UI
After you determine your performance goals and identify metrics to assess those goals, you can begin to identify and remediate bottlenecks by using the strategies in following sections.
Tuning practices for Spark job performance
You can use the following strategies for performance tuning AWS Glue for Spark jobs:
-
AWS Glue resources:
-
Spark applications:
Before you use these strategies, you must have access to metrics and configuration for your Spark job. You can find this information in the AWS Glue documentation.
From the AWS Glue resource perspective, you can achieve performance improvements by adding AWS Glue workers and using the latest AWS Glue version.
From an Apache Spark application perspective, you have access to several strategies that can improve performance. If unnecessary data is loaded into the Spark cluster, you can remove it to reduce the amount of loaded data. If you have underused Spark cluster resources and you have low data I/O, you can identify tasks to parallelize. You might also want to optimize heavy data transfer operations such as joins if they are taking substantial time. You can also optimize your job query plan or reduce the computational complexity of individual Spark tasks.
To efficiently apply these strategies, you must identify when they are applicable by consulting your metrics. For more details, see each of the following sections. These techniques work not only for performance tuning but also for solving typical problems such as out-of-memory (OOM) errors.