Metric |
Description |
glue.driver.aggregate.bytesRead
|
The number of bytes read from all data sources by all
completed Spark tasks running in all executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the
last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic
is used for aggregation.
Unit: Bytes
Can be used to monitor:
This metric can be used the same way as the glue.ALL.s3.filesystem.read_bytes
metric, with the difference that this metric is updated at the end of a Spark task
and captures non-S3 data sources as well.
|
glue.driver.aggregate.elapsedTime
|
The ETL elapsed time in milliseconds (does not include the job bootstrap times).
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Milliseconds
Can be used to determine how long it takes a job run to run on average.
Some ways to use the data:
|
glue.driver.aggregate.numCompletedStages
|
The number of completed stages in the job.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Count
Can be used to monitor:
Some ways to use the data:
|
glue.driver.aggregate.numCompletedTasks
|
The number of completed tasks in the job.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Count
Can be used to monitor:
|
glue.driver.aggregate.numFailedTasks
|
The number of failed tasks.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Count
Can be used to monitor:
Data abnormalities that cause job tasks to fail. Cluster abnormalities that cause job tasks to fail. Script abnormalities that cause job tasks to fail.
The data can be used to set alarms for increased failures
that might suggest abnormalities in data, cluster or scripts.
|
glue.driver.aggregate.numKilledTasks
|
The number of tasks killed.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Count
Can be used to monitor:
Some ways to use the data:
Set alarms for increased failures indicating data abnormalities. Set alarms for increased failures indicating cluster abnormalities. Set alarms for increased failures indicating script abnormalities.
|
glue.driver.aggregate.recordsRead
|
The number of records read from all data sources by all completed
Spark tasks running in all executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Count
Can be used to monitor:
This metric can be used in a similar way to the glue.ALL.s3.filesystem.read_bytes
metric, with the difference that this metric is updated at the end of a Spark task.
|
glue.driver.aggregate.shuffleBytesWritten
|
The number of bytes written by all executors to shuffle
data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard
as the number of bytes written for this purpose during the previous minute).
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Bytes
Can be used to monitor: Data shuffle in jobs (large joins,
groupBy, repartition, coalesce).
Some ways to use the data:
Repartition or decompress large input files
before further processing. Repartition data more uniformly to avoid hot keys. Pre-filter data before joins or groupBy operations.
|
glue.driver.aggregate.shuffleLocalBytesRead
|
The number of bytes read by all executors to shuffle
data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard
as the number of bytes read for this purpose during the previous minute).
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: SUM. This metric is a delta value from the last reported
value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.
Unit: Bytes
Can be used to monitor: Data shuffle in jobs (large joins,
groupBy, repartition, coalesce).
Some ways to use the data:
Repartition or decompress large input files
before further processing. Repartition data more uniformly using hot keys. Pre-filter data before joins or groupBy operations.
|
glue.driver.BlockManager.disk.diskSpaceUsed_MB
|
The number of megabytes of disk space used across all executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Average. This is a Spark metric, reported
as an absolute value.
Unit: Megabytes
Can be used to monitor:
Disk space used for blocks that represent cached RDD partitions. Disk space used for blocks that represent intermediate shuffle outputs. Disk space used for blocks that represent broadcasts.
Some ways to use the data:
Identify job failures due to increased disk usage. Identify large partitions resulting in spilling or shuffling. Increase provisioned DPU capacity to correct these issues.
|
glue.driver.ExecutorAllocationManager.executors.numberAllExecutors
|
The number of actively running job executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Average. This is a Spark metric, reported
as an absolute value.
Unit: Count
Can be used to monitor:
Some ways to use the data:
Repartition or decompress large input files beforehand
if cluster is under-utilized. Identify stage or job execution delays due to straggler
scenarios. • Compare with numberMaxNeededExecutors to understand
backlog for provisioning more DPUs.
|
glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors
|
The number of maximum (actively running and pending) job executors
needed to satisfy the current load.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Maximum. This is a Spark metric, reported
as an absolute value.
Unit: Count
Can be used to monitor:
Some ways to use the data:
Identify pending/backlog of scheduling queue. Identify stage or job execution delays due to straggler
scenarios. Compare with numberAllExecutors to understand
backlog for provisioning more DPUs. Increase provisioned DPU capacity to correct
the pending executor backlog.
|
glue.driver.jvm.heap.usage
glue. executorId.jvm.heap.usage
glue.ALL.jvm.heap.usage
|
The fraction of memory used by the JVM heap for this driver (scale: 0-1)
for driver, executor identified by executorId, or ALL executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Average. This is a Spark metric, reported
as an absolute value.
Unit: Percentage
Can be used to monitor:
Some ways to use the data:
Identify memory-consuming executor ids and stages. Identify straggling executor ids and stages. Identify a driver out-of-memory condition (OOM). Identify an executor out-of-memory condition (OOM)
and obtain the corresponding executor ID so as to be able to get
a stack trace from the executor log. Identify files or partitions that may have data
skew resulting in stragglers or out-of-memory conditions (OOMs).
|
glue.driver.jvm.heap.used
glue. executorId.jvm.heap.used
glue.ALL.jvm.heap.used
|
The number of memory bytes used by the JVM heap for the driver, the executor
identified by executorId, or ALL executors.
Valid dimensions: JobName (the name of the AWS Glue Job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Average. This is a Spark metric, reported as an absolute value.
Unit: Bytes
Can be used to monitor:
Some ways to use the data:
Identify memory-consuming executor ids and stages. Identify straggling executor ids and stages. Identify a driver out-of-memory condition (OOM). Identify an executor out-of-memory condition (OOM)
and obtain the corresponding executor ID so as to be able to get
a stack trace from the executor log. Identify files or partitions that may have data
skew resulting in stragglers or out-of-memory conditions (OOMs).
|
glue.driver.s3.filesystem.read_bytes
glue. executorId.s3.filesystem.read_bytes
glue.ALL.s3.filesystem.read_bytes
|
The number of bytes read from Amazon S3 by the driver, an executor identified
by executorId, or ALL executors since the previous report
(aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during
the previous minute).
Valid dimensions: JobName ,
JobRunId , and
Type (gauge).
Valid Statistics: SUM. This metric is a delta value from the last
reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used
for aggregation. The area under the curve on the AWS Glue Metrics Dashboard
can be used to visually compare bytes read by two different job runs.
Unit: Bytes.
Can be used to monitor:
ETL data movement. Job progress. Job bookmark issues (data processed, reprocessed, and skipped). Comparison of reads to ingestion rate from external data sources. Variance across job runs.
Resulting data can be used for:
|
glue.driver.s3.filesystem.write_bytes
glue. executorId.s3.filesystem.write_bytes
glue.ALL.s3.filesystem.write_bytes
|
The number of bytes written to Amazon S3 by the driver, an executor identified
by executorId, or ALL executors since the previous report
(aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during
the previous minute).
Valid dimensions: JobName ,
JobRunId , and
Type (gauge).
Valid Statistics: SUM. This metric is a delta value from the last
reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used
for aggregation. The area under the curve on the AWS Glue Metrics Dashboard
can be used to visually compare bytes written by two different job runs.
Unit: Bytes
Can be used to monitor:
ETL data movement. Job progress. Job bookmark issues (data processed, reprocessed, and skipped). Comparison of reads to ingestion rate from external data sources. Variance across job runs.
Some ways to use the data:
|
glue.driver.streaming.numRecords
|
The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above.
Valid dimensions: JobName (the name of the AWS Glue job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: Sum, Maximum, Minimum, Average, Percentile
Unit: Count
Can be used to monitor:
Records read. Job progress.
|
glue.driver.streaming.batchProcessingTimeInMs
|
The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above.
Valid dimensions: JobName (the name of the AWS Glue job),
JobRunId (the JobRun ID. or ALL ), and
Type (count).
Valid Statistics: Sum, Maximum, Minimum, Average, Percentile
Unit: Count
Can be used to monitor:
Job progress. Script performance.
|
glue.driver.system.cpuSystemLoad
glue. executorId.system.cpuSystemLoad
glue.ALL.system.cpuSystemLoad
|
The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified
by executorId, or ALL executors.
Valid dimensions: JobName (the name of the AWS Glue job),
JobRunId (the JobRun ID. or ALL ), and
Type (gauge).
Valid Statistics: Average. This metric is reported as an absolute value.
Unit: Percentage
Can be used to monitor:
Some ways to use the data:
DPU capacity Planning along with IO Metrics
(Bytes Read/Shuffle Bytes, Task Parallelism) and the number of
maximum needed executors metric. Identify the CPU/IO-bound ratio. This allows for
repartitionioning and increasing provisioned capacity for long-running
jobs with splittable datasets having lower CPU utilization.
|