Investigate performance issues by using the Spark UI -

Investigate performance issues by using the Spark UI

Before you apply any best practices to tune performance of your AWS Glue jobs, we highly recommended that you profile the performance and identify the bottlenecks. This will help you focus on the right things.

For quick analysis, Amazon CloudWatch metrics provide a basic view of your job metrics. The Spark UI provides a deeper view for performance tuning. To use the Spark UI with AWS Glue, you must enable Spark UI for your AWS Glue jobs. After you are familiar with the Spark UI, follow the strategies for tuning Spark job performance to identify and reduce the impact of bottlenecks based on your findings.

Identify bottlenecks by using the Spark UI

When you open the Spark UI, Spark applications are listed in a table. By default, an AWS Glue job's App Name is nativespark-<Job Name>-<Job Run ID>. Choose the target Spark app based on the job run ID to open the Jobs tab. Incomplete job runs, such as streaming job runs, are listed in Show incomplete applications.

The Jobs tab shows a summary of all jobs in the Spark application. To determine any stage or task failures, check the total number of tasks. To find the bottlenecks, sort by choosing Duration. Drill down to the details of long-running jobs by choosing the link shown in the Description column.

Spark Jobs tab showing duration, stages succeeded/total, and tasks succeeded/total.

The Details for Job page lists the stages. On this page, you can see overall insights such as duration, the number of succeeded and total tasks, the number of inputs and outputs, and the amount of shuffle read and shuffle write.

""

The Executor tab shows the Spark cluster capacity in detail. You can check the total number of cores. The cluster shown in the following screenshot contains 316 active cores and 512 cores in total. By default, each core can process one Spark task at the same time.

Executors page summary showing the number cores for executors.

Based on the value 5/5 shown on the Details for Job page, stage 5 is the longest stage, but it uses only 5 cores out of 512. Because the parallelism for this stage is so low, but it takes a significant amount of time, you can identify it as a bottleneck. To improve performance, you want to understand why. To learn more about how to recognize and reduce the impact of common performance bottlenecks, see Strategies for tuning Spark job performance.