Profiler rules ProfilerReport BatchSize CPUBottleneck GPUMemoryIncrease IOBottleneck LoadBalancing LowGPUUtilization OverallSystemUsage MaxInitializationTime OverallFrameworkMetrics StepOutlier

List of Debugger built-in profiler rules

Use the Debugger built-in profiler rules provided by Amazon SageMaker Debugger and analyze metrics collected while training your models. The Debugger built-in rules monitor various common conditions that are critical for the success of running a performant training job. You can call the built-in profiler rules using Amazon SageMaker Python SDK or the low-level SageMaker API operations. There's no additional cost for using the built-in rules. For more information about billing, see the Amazon SageMaker Pricing page.

Note

The maximum numbers of built-in profiler rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

Important

To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.


import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)

Profiler rules

The following rules are the Debugger built-in rules that are callable using the ProfilerRule.sagemaker classmethod.

Debugger built-in rule for generating the profiling report

Scope of Validity	Built-in Rules
Profiling Report for any SageMaker training job	`ProfilerReport`

Debugger built-in rules for profiling hardware system resource utilization (system metrics)

Scope of Validity	Built-in Rules
Generic system monitoring rules for any SageMaker training job	`BatchSize` `CPUBottleneck` `GPUMemoryIncrease` `IOBottleneck` `LoadBalancing` `LowGPUUtilization` `OverallSystemUsage`

Debugger built-in rules for profiling framework metrics

Scope of Validity	Built-in Rules
Profiling rules for deep learning frameworks (TensorFlow and PyTorch)	`MaxInitializationTime` `OverallFrameworkMetrics` `StepOutlier`

Warning

In favor of Amazon SageMaker Profiler, SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11

ProfilerReport

The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling report and updates when the individual rules are triggered. You can download a comprehensive profiling report while a training job is running or after the training job is complete. You can adjust the rule parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport rule.


rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            <BuiltInRuleName>_<parameter_name> = value
        )
    )  
]

If you trigger this ProfilerReport rule without any customized parameter as shown in the following example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling with their default parameter values.


rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

The following example code shows how to specify and adjust the CPUBottleneck rule's cpu_threshold parameter and the IOBottleneck rule's threshold parameter.


rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            CPUBottleneck_cpu_threshold = 90,
            IOBottleneck_threshold = 90
        )
    )  
]

To explore what's in the profiler report, see SageMaker Debugger Profiling Report. Also, because this rule activates all of the profiling rules, you can also check the rule analysis status using the SageMaker Debugger UI in SageMaker Studio Experiments.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description

base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

<BuiltInRuleName>_<parameter_name>

Customizable parameter to adjust thresholds of other built-in monitoring and profiling rules.

Optional

Default value: None

BatchSize

The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks that heavily overallocate memory. However, increasing the batch size can lead to processing or data loading bottlenecks because more data preprocessing time is required in each iteration.

Parameter Descriptions for the BatchSize Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`cpu_threshold_p95`	Defines the threshold for 95th quantile of CPU utilization in percentage. Optional Valid values: Integer Default value: `70` (in percentage)
`gpu_threshold_p95`	Defines the threshold for 95th quantile of GPU utilization in percentage. Optional Valid values: Integer Default value: `70` (in percentage)
`gpu_memory_threshold_p95`	Defines the threshold for 95th quantile of GPU memory utilization in percentage. Optional Valid values: Integer Default values: `70` (in percentage)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `100`
`window`	Window size for computing quantiles. Optional Valid values: Integer Default values: `500`
`scan_interval_us`	Time interval that timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

CPUBottleneck

The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the CPUBottleneck Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`threshold`	Defines the threshold for proportion of bottlenecked time to the total training time. If the proportion exceeds the percentage specified to the threshold parameter, the rule switches the rule status to True. Optional Valid values: Integer Default value: `50` (in percentage)
`gpu_threshold`	A threshold that defines low GPU utilization. Optional Valid values: Integer Default value: `10` (in percentage)
`cpu_threshold`	A threshold that defines high CPU utilization. Optional Valid values: Integer Default values: `90` (in percentage)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `100`
`scan_interval_us`	Time interval with which timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

GPUMemoryIncrease

The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.

Parameter Descriptions for the GPUMemoryIncrease Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`increase`	Defines the threshold for absolute memory increase. Optional Valid values: Integer Default value: `10` (in percentage)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `100`
`window`	Window size for computing quantiles. Optional Valid values: Integer Default values: `500`
`scan_interval_us`	Time interval that timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

IOBottleneck

This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the IOBottleneck Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`threshold`	Defines the threshold when Rule to return True. Optional Valid values: Integer Default value: `50` (in percentage)
`gpu_threshold`	A threshold that defines when GPU is considered underutilized. Optional Valid values: Integer Default value: `70` (in percentage)
`io_threshold`	A threshold that defines high IO wait time. Optional Valid values: Integer Default values: `50` (in percentage)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `1000`
`scan_interval_us`	Time interval that timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

LoadBalancing

The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.

Parameter Descriptions for the LoadBalancing Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`threshold`	Defines the workload percentage. Optional Valid values: Integer Default value: `0.5` (unitless proportion)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `10`
`scan_interval_us`	Time interval that timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

LowGPUUtilization

The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold_p95 which indicates underutilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is below threshold_p5 which indicates fluctuations.

Parameter Descriptions for the LowGPUUtilization Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`threshold_p95`	A threshold for 95th quantile below which GPU is considered to be underutilized. Optional Valid values: Integer Default value: `70` (in percentage)
`threshold_p5`	A threshold for 5th quantile. Default is 10 percent. Optional Valid values: Integer Default values: `10` (in percentage)
`patience`	Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: `1000`
`window`	Window size for computing quantiles. Optional Valid values: Integer Default values: `500`
`scan_interval_us`	Time interval that timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

OverallSystemUsage

The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description

base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us

Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

MaxInitializationTime

The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule waits until the first step is available.

Parameter Descriptions for the MaxInitializationTime Rule

Parameter Name Description

base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold

Defines the threshold in minutes to wait for the first step to become available.

Optional

Valid values: Integer

Default value: 20 (in minutes)

scan_interval_us

Time interval with which timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

OverallFrameworkMetrics

The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward and backward pass, and data loading.

Parameter Descriptions for the OverallFrameworkMetrics Rule

Parameter Name Description

base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us

Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

StepOutlier

The StepOutlier rule helps detect outliers in step durations. This rule returns True if there are outliers with step durations larger than stddev sigmas of the entire step durations in a time range.

Parameter Descriptions for the StepOutlier Rule

Parameter Name	Description
`base_trial`	The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String
`stddev`	Defines a factor by which to multiply the standard deviation. For example, the rule is invoked by default when a step duration is larger or smaller than 5 times the standard deviation. Optional Valid values: Integer Default value: `5` (in minutes)
`mode`	Mode under which steps have been saved and on which Rule should run on. Per default rule will run on steps from EVAL and TRAIN phase Optional Valid values: Integer Default value: `5` (in minutes)
`n_outliers`	How many outliers to ignore before rule returns True Optional Valid values: Integer Default value: `10`
`scan_interval_us`	Time interval with which timeline files are scanned. Optional Valid values: Integer Default values: `60000000` (in microseconds)

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Use built-in profiler rules

SageMaker Debugger UI in SageMaker Studio Classic Experiments