Scale cluster capacity -

Scale cluster capacity

If your job is taking too much time, but executors are consuming sufficient resources and Spark is creating a large volume of tasks relative to available cores, consider scaling cluster capacity. To assess if this is appropriate, use the following metrics.

CloudWatch metrics

  • Check CPU Load and Memory Utilization to determine whether executors are consuming sufficient resources.

  • Check how long the job has run to assess whether the processing time is too long to meet your performance goals.

In the following example, four executors are running at more than 97 percent CPU load, but processing has not been completed after about three hours.

""
Note

If CPU load is low, you probably will not benefit from scaling cluster capacity.

Spark UI

On the Job tab or the Stage tab, you can see the number of tasks for each job or stage. In the following example, Spark has created 58100 tasks.

Stages for All Jobs showing one stage and 58,100 tasks.

On the Executor tab, you can see the total number of executors and tasks. In the following screenshot, each Spark executor has four cores and can perform four tasks concurrently.

Executors table showing the Cores column.

In this example, the number of Spark tasks (58100) is much larger than the 16 tasks that the executors can process concurrently (4 executors × 4 cores ).

If you observe these symptoms, consider scaling the cluster. You can scale cluster capacity by using the following options:

  • Enable AWS Glue Auto ScalingAuto Scaling is available for your AWS Glue extract, transform, and load (ETL) and streaming jobs in AWS Glue version 3.0 or later. AWS Glue automatically adds and removes workers from the cluster depending on the number of partitions at each stage or the rate at which microbatches are generated on the job run.

    If you observe a situation where the number of workers does not increase even though Auto Scaling is enabled, consider adding workers manually. However, note that scaling manually for one stage might cause many workers to be idle during later stages, costing more for zero performance gain.

    After you enable Auto Scaling, you can see the number of executors in the CloudWatch executor metrics. Use the following metrics to monitor the demand for executors in Spark applications:

    • glue.driver.ExecutorAllocationManager.executors.numberAllExecutors

    • glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors

    For more information about metrics, see Monitoring AWS Glue using Amazon CloudWatch metrics.

  • Scale out: Increase the number of AWS Glue workers – You can manually increase the number of AWS Glue workers. Add workers only until you observe idle workers. At that point, adding more workers will increase costs without improving results. For more information, see Parallelize tasks.

  • Scale up: Use a larger worker type – You can manually change the instance type of your AWS Glue workers to use workers with more cores, memory, and storage. Larger worker types make it possible for you to vertically scale and run intensive data integration jobs, such as memory-intensive data transforms, skewed aggregations, and entity-detection checks involving petabytes of data.

    Scaling up also assists in cases where the Spark driver needs larger capacity—for instance, because the job query plan is quite large. For more information about worker types and performance, see the AWS Big Data Blog post Scale your AWS Glue for Apache Spark jobs with new larger worker types G.4X and G.8X.

    Using larger workers can also reduce the total number of workers needed, which increases performance by reducing shuffle in intensive operations such as join.