Monitoring AWS Glue using Amazon CloudWatch metrics - AWS Glue

Monitoring AWS Glue using Amazon CloudWatch metrics

You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

Note

You may incur additional charges when you enable job metrics and CloudWatch custom metrics are created. For more information, see Amazon CloudWatch pricing .

AWS Glue metrics overview

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI).

To view metrics using the AWS Glue console dashboard

You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run.

  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. In the navigation pane, choose Job run monitoring.

  3. In Job runs choose Actions to stop a job that is currently running, view a job, or rewind job bookmark.

  4. Select a job, then choose View run details to view additional information about the job run.

To view metrics using the CloudWatch console dashboard

Metrics are grouped first by the service namespace, and then by the various dimension combinations within each namespace.

  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the navigation pane, choose Metrics.

  3. Choose the Glue namespace.

To view metrics using the AWS CLI
  • At a command prompt, use the following command.

    aws cloudwatch list-metrics --namespace Glue

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch metrics dashboards are configured to display them every minute. The AWS Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute.

AWS Glue metrics behavior for Spark jobs

AWS Glue metrics are enabled at initialization of a GlueContext in a script and are generally updated only at the end of an Apache Spark task. They represent the aggregate values across all completed Spark tasks so far.

However, the Spark metrics that AWS Glue passes on to CloudWatch are generally absolute values representing the current state at the time they are reported. AWS Glue reports them to CloudWatch every 30 seconds, and the metrics dashboards generally show the average across the data points received in the last 1 minute.

AWS Glue metrics names are all preceded by one of the following types of prefix:

  • glue.driver. – Metrics whose names begin with this prefix either represent AWS Glue metrics that are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark driver.

  • glue.executorId. – The executorId is the number of a specific Spark executor. It corresponds with the executors listed in the logs.

  • glue.ALL. – Metrics whose names begin with this prefix aggregate values from all Spark executors.

AWS Glue metrics

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:

Metric Description

glue.driver.aggregate.bytesRead

The number of bytes read from all data sources by all completed Spark tasks running in all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor:

  • Bytes read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Variance across Job Runs.

This metric can be used the same way as the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.

glue.driver.aggregate.elapsedTime

The ETL elapsed time in milliseconds (does not include the job bootstrap times).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Milliseconds

Can be used to determine how long it takes a job run to run on average.

Some ways to use the data:

  • Set alarms for stragglers.

  • Measure variance across job runs.

glue.driver.aggregate.numCompletedStages

The number of completed stages in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Per-stage timeline of job execution,when correlated with other metrics.

Some ways to use the data:

  • Identify demanding stages in the execution of a job.

  • Set alarms for correlated spikes (demanding stages) across job runs.

glue.driver.aggregate.numCompletedTasks

The number of completed tasks in the job.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Job progress.

  • Parallelism within a stage.

glue.driver.aggregate.numFailedTasks

The number of failed tasks.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Data abnormalities that cause job tasks to fail.

  • Cluster abnormalities that cause job tasks to fail.

  • Script abnormalities that cause job tasks to fail.

The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.

glue.driver.aggregate.numKilledTasks

The number of tasks killed.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Abnormalities in Data Skew that result in exceptions (OOMs) that kill tasks.

  • Script abnormalities that result in exceptions (OOMs) that kill tasks.

Some ways to use the data:

  • Set alarms for increased failures indicating data abnormalities.

  • Set alarms for increased failures indicating cluster abnormalities.

  • Set alarms for increased failures indicating script abnormalities.

glue.driver.aggregate.recordsRead

The number of records read from all data sources by all completed Spark tasks running in all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Count

Can be used to monitor:

  • Records read.

  • Job progress.

  • JDBC data sources.

  • Job Bookmark Issues.

  • Skew in Job Runs over days.

This metric can be used in a similar way to the glue.ALL.s3.filesystem.read_bytes metric, with the difference that this metric is updated at the end of a Spark task.

glue.driver.aggregate.shuffleBytesWritten

The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly to avoid hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.aggregate.shuffleLocalBytesRead

The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute).

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.

Unit: Bytes

Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce).

Some ways to use the data:

  • Repartition or decompress large input files before further processing.

  • Repartition data more uniformly using hot keys.

  • Pre-filter data before joins or groupBy operations.

glue.driver.BlockManager.disk.diskSpaceUsed_MB

The number of megabytes of disk space used across all executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Megabytes

Can be used to monitor:

  • Disk space used for blocks that represent cached RDD partitions.

  • Disk space used for blocks that represent intermediate shuffle outputs.

  • Disk space used for blocks that represent broadcasts.

Some ways to use the data:

  • Identify job failures due to increased disk usage.

  • Identify large partitions resulting in spilling or shuffling.

  • Increase provisioned DPU capacity to correct these issues.

glue.driver.ExecutorAllocationManager.executors.numberAllExecutors

The number of actively running job executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Straggling executors (with a few executors running only)

  • Current executor-level parallelism.

Some ways to use the data:

  • Repartition or decompress large input files beforehand if cluster is under-utilized.

  • Identify stage or job execution delays due to straggler scenarios.

  • • Compare with numberMaxNeededExecutors to understand backlog for provisioning more DPUs.

glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors

The number of maximum (actively running and pending) job executors needed to satisfy the current load.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value.

Unit: Count

Can be used to monitor:

  • Job activity.

  • Current executor-level parallelism and backlog of pending tasks not yet scheduled because of unavailable executors due to DPU capacity or killed/failed executors.

Some ways to use the data:

  • Identify pending/backlog of scheduling queue.

  • Identify stage or job execution delays due to straggler scenarios.

  • Compare with numberAllExecutors to understand backlog for provisioning more DPUs.

  • Increase provisioned DPU capacity to correct the pending executor backlog.

glue.driver.jvm.heap.usage

glue.executorId.jvm.heap.usage

glue.ALL.jvm.heap.usage

The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver out-of-memory conditions (OOM) using glue.driver.jvm.heap.usage.

  • Executor out-of-memory conditions (OOM) using glue.ALL.jvm.heap.usage.

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.jvm.heap.used

glue.executorId.jvm.heap.used

glue.ALL.jvm.heap.used

The number of memory bytes used by the JVM heap for the driver, the executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This is a Spark metric, reported as an absolute value.

Unit: Bytes

Can be used to monitor:

  • Driver out-of-memory conditions (OOM).

  • Executor out-of-memory conditions (OOM).

Some ways to use the data:

  • Identify memory-consuming executor ids and stages.

  • Identify straggling executor ids and stages.

  • Identify a driver out-of-memory condition (OOM).

  • Identify an executor out-of-memory condition (OOM) and obtain the corresponding executor ID so as to be able to get a stack trace from the executor log.

  • Identify files or partitions that may have data skew resulting in stragglers or out-of-memory conditions (OOMs).

glue.driver.s3.filesystem.read_bytes

glue.executorId.s3.filesystem.read_bytes

glue.ALL.s3.filesystem.read_bytes

The number of bytes read from Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs.

Unit: Bytes.

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Resulting data can be used for:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.s3.filesystem.write_bytes

glue.executorId.s3.filesystem.write_bytes

glue.ALL.s3.filesystem.write_bytes

The number of bytes written to Amazon S3 by the driver, an executor identified by executorId, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute).

Valid dimensions: JobName, JobRunId, and Type (gauge).

Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs.

Unit: Bytes

Can be used to monitor:

  • ETL data movement.

  • Job progress.

  • Job bookmark issues (data processed, reprocessed, and skipped).

  • Comparison of reads to ingestion rate from external data sources.

  • Variance across job runs.

Some ways to use the data:

  • DPU capacity planning.

  • Setting alarms for large spikes or dips in data read for job runs and job stages.

glue.driver.streaming.numRecords

The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above.

Valid dimensions: JobName (the name of the AWS Glue job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: Sum, Maximum, Minimum, Average, Percentile

Unit: Count

Can be used to monitor:

  • Records read.

  • Job progress.

glue.driver.streaming.batchProcessingTimeInMs

The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above.

Valid dimensions: JobName (the name of the AWS Glue job), JobRunId (the JobRun ID. or ALL), and Type (count).

Valid Statistics: Sum, Maximum, Minimum, Average, Percentile

Unit: Count

Can be used to monitor:

  • Job progress.

  • Script performance.

glue.driver.system.cpuSystemLoad

glue.executorId.system.cpuSystemLoad

glue.ALL.system.cpuSystemLoad

The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by executorId, or ALL executors.

Valid dimensions: JobName (the name of the AWS Glue job), JobRunId (the JobRun ID. or ALL), and Type (gauge).

Valid Statistics: Average. This metric is reported as an absolute value.

Unit: Percentage

Can be used to monitor:

  • Driver CPU load.

  • Executor CPU load.

  • Detecting CPU-bound or IO-bound executors or stages in a Job.

Some ways to use the data:

  • DPU capacity Planning along with IO Metrics (Bytes Read/Shuffle Bytes, Task Parallelism) and the number of maximum needed executors metric.

  • Identify the CPU/IO-bound ratio. This allows for repartitionioning and increasing provisioned capacity for long-running jobs with splittable datasets having lower CPU utilization.

Dimensions for AWS Glue Metrics

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:

Dimension Description

JobName

This dimension filters for metrics of all job runs of a specific AWS Glue job.

JobRunId

This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or ALL.

Type

This dimension filters for metrics by either count (an aggregate number) or gauge (a value at a point in time).

For more information, see the Amazon CloudWatch User Guide.