List of Debugger Built-in Rules
Use the Debugger built-in rules provided by Amazon SageMaker Debugger and analyze metrics and tensors collected
while training your models. The Debugger built-in rules monitor various common conditions that
are critical for the success of a training job. You can call the built-in rules using
Amazon SageMaker Python SDK
Note
The maximum numbers of built-in rules that you can attach to a training job are 20 for
ProfilerRule
and 20 for Rule
. SageMaker Debugger fully manages
the built-in rules and analyzes your training job synchronously.
Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.
import sys import IPython !{sys.executable} -m pip install -U sagemaker smdebug IPython.Application.instance().kernel.do_shutdown(True)
Debugger ProfilerRule
The following rules are the Debugger built-in rules that are callable using the
ProfilerRule.sagemaker
classmethod.
Debugger built-in rule for generating the
profiling report | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scope of Validity | Built-in Rules | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Profiling Report for any SageMaker training job |
Debugger built-in rules for profiling hardware
system resource utilization (system metrics) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scope of Validity | Built-in Rules | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Generic system monitoring rules for any SageMaker training job |
Debugger built-in rules for profiling framework
metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scope of Validity | Built-in Rules | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Profiling rules for deep learning frameworks (TensorFlow and PyTorch) |
Warning
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.
-
SageMaker Python SDK <= v2.130.0
-
PyTorch >= v1.6.0, < v2.0
-
TensorFlow >= v2.3.1, < v2.11
With the deprecation, SageMaker Debugger also discontinues support for the three
ProfilerRules
for framework profiling. See also Amazon SageMaker Debugger Release Notes: March
16, 2023.
Debugger Rule
The following rules are the Debugger built-in rules that are callable using the
Rule.sagemaker
classmethod.
Debugger built-in rules for generating training reports | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scope of Validity | Built-in Rules | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Training Report for SageMaker XGboost training job |
Debugger built-in rules for debugging model
training data (output tensors) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scope of Validity | Built-in Rules | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deep learning frameworks (TensorFlow, MXNet, and PyTorch) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deep learning frameworks (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deep learning applications | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
XGBoost algorithm |
To use the built-in rules with default parameter values – use the following configuration format:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ ProfilerRule.sagemaker(rule_configs.
BuiltInRuleName_1
()), ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2
()), ... ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n
()), Rule.sagemaker(rule_configs.built_in_rule_name_1
()), Rule.sagemaker(rule_configs.built_in_rule_name_2
()), ... Rule.sagemaker(rule_configs.built_in_rule_name_n
()) ]
To use the built-in rules with customizing the parameter values – use the following configuration format:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ ProfilerRule.sagemaker( base_config=rule_configs.
BuiltInRuleName
(), rule_parameters={ "key
": "value
" } ) Rule.sagemaker( base_config=rule_configs.built_in_rule_name
(), rule_parameters={ "key
": "value
" } collections_to_save=[ CollectionConfig( name="tensor_collection_name
", parameters={ "key
": "value
" } ) ] ) ]
To find available keys for the rule_parameters
parameter, see the parameter description tables.
Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
-
For a full instruction and examples of using the Debugger built-in rules, see Debugger Built-in Rules Example Code.
-
For a full instruction on using the built-in rules with the low-level SageMaker API operations, see Configure Debugger Using Amazon SageMaker API.
ProfilerReport
The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling report and updates when the individual rules are triggered. You can download a comprehensive profiling report while a training job is running or after the training job is complete. You can adjust the rule parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport rule.
rules=[ ProfilerRule.sagemaker( rule_configs.
ProfilerReport
(<BuiltInRuleName>
_<parameter_name>
=value
) ) ]
If you trigger this ProfilerReport rule without any customized parameter as shown in the following example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling with their default parameter values.
rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]
The following example code shows how to specify and adjust the CPUBottleneck rule's
cpu_threshold
parameter and the IOBottleneck rule's
threshold
parameter.
rules=[ ProfilerRule.sagemaker( rule_configs.ProfilerReport(
CPUBottleneck_cpu_threshold
=90
,IOBottleneck_threshold
=90
) ) ]
To explore what's in the profiler report, see SageMaker Debugger Profiling Report. Also, because this rule activates all of the profiling rules, you can also check the rule analysis status using the SageMaker Debugger UI in SageMaker Studio Experiments.
Parameter Descriptions for the OverallSystemUsage Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<BuiltInRuleName>_<parameter_name> |
Customizable parameter to adjust thresholds of other built-in monitoring and profiling rules. Optional Default value: |
BatchSize
The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks that heavily overallocate memory. However, increasing the batch size can lead to processing or data loading bottlenecks because more data preprocessing time is required in each iteration.
Parameter Descriptions for the BatchSize Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cpu_threshold_p95 |
Defines the threshold for 95th quantile of CPU utilization in percentage. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gpu_threshold_p95 |
Defines the threshold for 95th quantile of GPU utilization in percentage. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gpu_memory_threshold_p95 |
Defines the threshold for 95th quantile of GPU memory utilization in percentage. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
window |
Window size for computing quantiles. Optional Valid values: Integer Default values:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval that timeline files are scanned. Optional Valid values: Integer Default values: |
CPUBottleneck
The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.
Parameter Descriptions for the CPUBottleneck Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Defines the threshold for proportion of bottlenecked time to the total training time. If the proportion exceeds the percentage specified to the threshold parameter, the rule switches the rule status to True. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gpu_threshold |
A threshold that defines low GPU utilization. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cpu_threshold |
A threshold that defines high CPU utilization. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval with which timeline files are scanned. Optional Valid values: Integer Default values: |
GPUMemoryIncrease
The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.
Parameter Descriptions for the GPUMemoryIncrease Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
increase |
Defines the threshold for absolute memory increase. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
window |
Window size for computing quantiles. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval that timeline files are scanned. Optional Valid values: Integer Default values: |
IOBottleneck
This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.
Parameter Descriptions for the IOBottleneck Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Defines the threshold when Rule to return
True. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gpu_threshold |
A threshold that defines when GPU is considered underutilized. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
io_threshold |
A threshold that defines high IO wait time. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts
evaluation. The first several steps of training jobs usually show high
volume of data processes, so keep the rule patient and prevent it from
being invoked too soon with a given number of profiling data that you
specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval that timeline files are scanned. Optional Valid values: Integer Default values: |
LoadBalancing
The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.
Parameter Descriptions for the LoadBalancing Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Defines the workload percentage. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval that timeline files are scanned. Optional Valid values: Integer Default values: |
LowGPUUtilization
The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold_p95 which indicates underutilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is below threshold_p5 which indicates fluctuations.
Parameter Descriptions for the LowGPUUtilization Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_p95 |
A threshold for 95th quantile below which GPU is considered to be underutilized. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_p5 |
A threshold for 5th quantile. Default is 10
percent. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
window |
Window size for computing quantiles. Optional Valid values: Integer Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval that timeline files are scanned. Optional Valid values: Integer Default values: |
OverallSystemUsage
The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles.
Parameter Descriptions for the OverallSystemUsage Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval to scan timeline files. Optional Valid values: Integer Default values: |
MaxInitializationTime
The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule waits until the first step is available.
Parameter Descriptions for the MaxInitializationTime Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Defines the threshold in minutes to wait for the first step to become available. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval with which timeline files are scanned. Optional Valid values: Integer Default values: |
OverallFrameworkMetrics
The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward and backward pass, and data loading.
Parameter Descriptions for the OverallFrameworkMetrics Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval to scan timeline files. Optional Valid values: Integer Default values: |
StepOutlier
The StepOutlier rule helps detect outliers in step durations. This rule
returns True
if there are outliers with step durations larger than
stddev
sigmas of the entire step durations in a time
range.
Parameter Descriptions for the StepOutlier Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stddev |
Defines a factor by which to multiply the standard deviation. For example, the rule is invoked by default when a step duration is larger or smaller than 5 times the standard deviation. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mode |
Mode under which steps have been saved and on which Rule
should run on. Per default rule will run on steps from EVAL and
TRAIN phase Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
n_outliers |
How many outliers to ignore before rule returns True Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
scan_interval_us |
Time interval with which timeline files are scanned. Optional Valid values: Integer Default values: |
CreateXgboostReport
The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a comprehensive training report. You can download a comprehensive profiling report while a training job is running or after the training job is complete, and check progress of training or the final result of the training job. The CreateXgboostReport rule collects the following output tensors by default:
-
hyperparameters
– Saves at the first step -
metrics
– Saves loss and accuracy every 5 steps -
feature_importance
– Saves every 5 steps -
predictions
– Saves every 5 steps -
labels
– Saves every 5 steps
Parameter Descriptions for the CreateXgboostReport Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
rules=[ Rule.sagemaker( rule_configs.
create_xgboost_report
() ) ]
DeadRelu
This rule detects when the percentage of rectified linear unit (ReLU) activation
functions in a trial are considered dead because their activation activity has dropped
below a threshold. If the percent of inactive ReLUs in a layer is greater than the
threshold_layer
value of inactive ReLUs, the rule returns
True
.
Parameter Descriptions for the DeadRelu Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_inactivity |
Defines a level of activity below which a ReLU is considered to be
dead. A ReLU might be active in the beginning of a trial and then
slowly die during the training process. If the ReLU is active less
than the Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_layer |
Returns Returns Optional Valid values: Float Default values: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.dead_relu(), rule_parameters={ "tensor_regex": "
.*relu_output|.*ReLU_output
", "threshold_inactivity": "1.0
", "threshold_layer": "50.0
" }, collections_to_save=[ CollectionConfig( name="custom_relu_collection
", parameters={ "include_regex: ".*relu_output|.*ReLU_output
", "save_interval": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
ExplodingTensor
This rule detects whether the tensors emitted during training have non-finite values,
either infinite or NaN (not a number). If a non-finite value is detected, the rule
returns True
.
Parameter Descriptions for the ExplodingTensor Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
only_nan |
Optional Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
exploding_tensor
(), rule_parameters={ "tensor_regex": ".*gradient
", "only_nan": "False
" }, collections_to_save=[ CollectionConfig( name="gradients
", parameters={ "save_interval": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
PoorWeightInitialization
This rule detects if your model parameters have been poorly initialized.
Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively. Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an initialization can lead to exploding gradients. This rule checks the variance of activation inputs across layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural network has been poorly initialized.
Parameter Descriptions for the PoorWeightInitialization Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
activation_inputs_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
If the ratio between minimum and maximum variance of weights per
layer exceeds the Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
distribution_range |
If the minimum difference between 5th and 95th percentiles of the
gradient distribution is less than the
Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
The number of steps to wait until the loss is considered to be no longer decreasing. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
steps |
The number of steps this rule analyzes. You typically need to check only the first few iterations. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
poor_weight_initialization
(), rule_parameters={ "activation_inputs_regex": ".*relu_input|.*ReLU_input
", "threshold": "10.0
", "distribution_range": "0.001
", "patience": "5
", "steps": "10
" }, collections_to_save=[ CollectionConfig( name="custom_relu_collection
", parameters={ "include_regex
": ".*relu_input|.*ReLU_input
", "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
SaturatedActivation
This rule detects if the tanh and sigmoid activation layers are becoming saturated. An
activation layer is saturated when the input of the layer is close to the maximum or
minimum of the activation function. The minimum and maximum of the tanh and sigmoid
activation functions are defined by their respective min_threshold
and
max_thresholds
values. If the activity of a node drops below the
threshold_inactivity
percentage, it is considered saturated. If more
than a threshold_layer
percent of the nodes are saturated, the rule returns
True
.
Parameter Descriptions for the SaturatedActivation Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_tanh_min |
The minimum and maximum thresholds that define the extremes of the
input for a tanh activation function, defined as:
Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_tanh_max |
The minimum and maximum thresholds that define the extremes of the
input for a tanh activation function, defined as:
Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_sigmoid_min |
The minimum and maximum thresholds that define the extremes of the
input for a sigmoid activation function, defined as:
Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_sigmoid_max |
The minimum and maximum thresholds that define the extremes of the
input for a sigmoid activation function, defined as:
Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_inactivity |
The percentage of inactivity below which the activation layer is considered to be saturated. The activation might be active in the beginning of a trial and then slowly become less active during the training process. Optional Valid values: Float Default values: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_layer |
Returns Returns Optional Valid values: Float Default values: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
saturated_activation
(), rule_parameters={ "tensor_regex": ".*tanh_input|.*sigmoid_input
", "threshold_tanh_min": "-9.4999
", "threshold_tanh_max": "9.4999
", "threshold_sigmoid_min": "-23
", "threshold_sigmoid_max": "16.99999
", "threshold_inactivity": "1.0
", "threshold_layer": "50.0
" }, collections_to_save=[ CollectionConfig( name="custom_activations_collection
", parameters={ "include_regex
": ".*tanh_input|.*sigmoid_input
" "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
VanishingGradient
This rule detects if the gradients in a trial become extremely small or drop to a zero
magnitude. If the mean of the absolute values of the gradients drops below a specified
threshold
, the rule returns True
.
Parameters Descriptions for the VanishingGradient Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
The value at which the gradient is determined to be
vanishing. Optional Valid values: Float Default value:
|
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
vanishing_gradient
(), rule_parameters={ "threshold": "0.0000001
" }, collections_to_save=[ CollectionConfig( name="gradients
", parameters={ "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
WeightUpdateRatio
This rule keeps track of the ratio of updates to weights during training and detects
if that ratio gets too large or too small. If the ratio of updates to weights is larger
than the large_threshold value
or if this ratio is smaller than
small_threshold
, the rule returns True
.
Conditions for training are best when the updates are commensurate to gradients.
Excessively large updates can push the weights away from optimal values, and very small
updates result in very slow convergence. This rule requires weights to be available for
two training steps, and train.save_interval
needs to be set equal to
num_steps
.
Parameter Descriptions for the WeightUpdateRatio Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name, | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
num_steps |
The number of steps across which the rule checks to determine if the tensor has changed. The number of steps across which you want to compare the weight
ratios. If you pass no value, the rule runs by default against the
current step and the immediately previous saved step. If you
override the default by passing a value for this parameter, the
comparison is done between weights at step Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
large_threshold |
The maximum value that the ratio of updates to weight can take
before the rule returns Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
small_threshold |
The minimum value that the ratio of updates to weight can take,
below which the rule returns Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
epsilon |
A small constant used to ensure that Debugger does not divide by zero when computing the ratio updates to weigh. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
weight_update_ratio
(), rule_parameters={ "num_steps": "100
", "large_threshold": "10.0
", "small_threshold": "0.00000001
", "epsilon": "0.000000001
" }, collections_to_save=[ CollectionConfig( name="weights
", parameters={ "train.save_interval
": "100
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
AllZero
This rule detects if all or a specified percentage of the tensor values are zero.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameters Descriptions for the AllZero Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Specifies the percentage of values in the tensor that needs to be zero for this rule to be invoked. Optional Valid values: Float Default value: 100 (in percentage) |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
all_zero
(), rule_parameters={ "tensor_regex": ".*
", "threshold": "100
" }, collections_to_save=[ CollectionConfig( name="all
", parameters={ "save_interval
": "500
" } ) ] ) ]
ClassImbalance
This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.
Classification models require well-balanced classes in the training dataset or a proper weighting/sampling of classes during training. The rule performs the following checks:
-
It counts the occurrences per class. If the ratio of number of samples between smallest and largest class is larger than the
threshold_imbalance
, an error is thrown. -
It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied, then the model can reach high accuracy for the class with many training samples, but low accuracy for the classes with few training samples. If a fraction of mispredictions for a certain class is above
threshold_misprediction
, an error is thrown.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the ClassImbalance Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_imbalance |
The acceptable imbalance between the number of samples in the smallest class and in the largest class. Exceeding this threshold value throws an error. Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_misprediction |
A limit on the fraction of mispredictions allowed for each class. Exceeding this threshold throws an error. The underrepresented classes are most at risk of crossing this threshold. Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
samples |
The number of labels that have to be processed before an imbalance
is evaluated. The rule might not be triggered until it has seen
sufficient samples across several steps. The more classes that your
dataset contains, the larger this Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
argmax |
If Conditional Valid values: Boolean Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
labels_regex |
The name of the tensor that contains the labels. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
predictions_regex |
The name of the tensor that contains the predictions. Optional Valid values: String Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
class_imbalance
(), rule_parameters={ "threshold_imbalance": "10
", "threshold_misprediction": "0.7
", "samples": "500
", "argmax": "False
", "labels_regex": ".*labels
", "predictions_regex": ".*predictions
" }, collections_to_save=[ CollectionConfig( name="custom_output_collection
", parameters={ "include_regex": ".*labels|.*predictions
", "save_interval": "500
" } ) ] ) ]
LossNotDecreasing
This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be scalars.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the LossNotDecreasing Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns that is used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
use_losses_collection |
If set to Optional Valid values: Boolean Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
num_steps |
The minimum number of steps after which the rule checks if the
loss has decreased. Rule evaluation happens every
Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
diff_percent |
The minimum percentage difference by which the loss should
decrease between Optional Valid values: Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
increase_threshold_percent |
The maximum threshold percent that loss is allowed to increase in case loss has been increasing Optional Valid values: Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
mode |
The name of the Debugger mode to query tensor values for rule
checking. If this is not passed, the rule checks in order by default
for the Optional Valid values: String ( Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
loss_not_decreasing
(), rule_parameters={ "tensor_regex": ".*
", "use_losses_collection": "True
", "num_steps": "10
", "diff_percent": "0.1
", "increase_threshold_percent": "5
", "mode": "GLOBAL
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
Overfit
This rule detects if your model is being overfit to the training data by comparing the validation and training losses.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
A standard way to prevent overfitting is to regularize your model.
Parameter Descriptions for the Overfit Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
start_step |
The step from which to start comparing the validation and training loss. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience |
The number of steps for which the Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ratio_threshold |
The maximum ratio of the difference between the mean validation
loss and mean training loss to the mean training loss. If this
threshold is exceeded for a Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
overfit
(), rule_parameters={ "tensor_regex": ".*
", "start_step": "0
", "patience": "1
", "ratio_threshold": "0.1
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "train.save_interval": "100
", "eval.save_interval": "10
" } ) ] ) ]
Overtraining
This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved model (both training and validation loss decrease), the model approaches to a minimum of the loss function and does not improve anymore. If the model continues training it can happen that validation loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to determine if the model is not improving, and prevents overfitting problems due to overtraining.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
Overtraining can be avoided by early stopping. For information on early stopping,
see Stop Training Jobs Early. For an example that
shows how to use spot training with Debugger, see Enable Spot Training with Amazon SageMaker Debugger
Parameter Descriptions for the Overtraining Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience_train |
The number of steps to wait before the training loss is considered to not to be improving anymore. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
patience_validation |
The number of steps to wait before the validation loss is considered
to not to be improving anymore. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
delta |
The minimum threshold by how much the error should improve before it is considered as a new optimum. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
overtraining
(), rule_parameters={ "patience_train": "5
", "patience_validation": "10
", "delta": "0.01
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
SimilarAcrossRuns
This rule compares tensors gathered from a base trial with tensors from another trial.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the SimilarAcrossRuns Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
other_trials |
A completed training job name whose tensors you want to compare to
those tensors gathered from the current
Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
similar_across_runs
(), rule_parameters={ "other_trials": "<specify-another-job-name>
", "collection_names": "losses
", "tensor_regex": ".*
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
StalledTrainingRule
StalledTrainingRule detects if there is no progress made on training job, and stops
the training job if the rule fires. This rule requires tensors to be periodically saved
in a time interval defined by its threshold
parameter. This rule keeps on
monitoring for new tensors, and if no new tensor has been emitted for threshold interval
rule gets fired.
Parameter Descriptions for the StalledTrainingRule Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
A threshold that defines by how much time in seconds the rule waits for a tensor output until it fires a stalled training issue. Default value is 1800 seconds. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stop_training_on_fire |
If set to Optional Valid values: Boolean Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
training_job_name_prefix |
The prefix of base training job name. If
Optional Valid values: String |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
stalled_training_rule
(), rule_parameters={ "threshold": "1800
", "stop_training_on_fire": "True
", "training_job_name_prefix": "<specify-training-base-job-name>
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
TensorVariance
This rule detects if you have tensors with very high or low variances. Very high or low variances in a tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues early.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the TensorVariance Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
max_threshold |
The threshold for the upper bound of tensor variance. Optional Valid values: Float Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
min_threshold |
The threshold for the lower bound of tensor variance. Optional Valid values: Float Default value: None |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
tensor_variance
(), rule_parameters={ "collection_names": "weights
", "max_threshold": "10
", "min_threshold": "0.00001
", }, collections_to_save=[ CollectionConfig( name="weights
", parameters={ "save_interval
": "500
" } ) ] ) ]
UnchangedTensor
This rule detects whether a tensor is no longer changing across steps.
This rule runs the numpy.allclose
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the UnchangedTensor Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patternsused to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
num_steps |
The number of steps across which the rule checks to determine if the tensor has changed. This checks the last Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
rtol |
The relative tolerance parameter to be passed to the Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
atol |
The absolute tolerance parameter to be passed to the Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
equal_nan |
Whether to compare NaNs as equal. If Optional Valid values: Boolean Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
unchanged_tensor
(), rule_parameters={ "collection_names": "losses
", "tensor_regex": "", "num_steps": "3
", "rtol": "1e-05
", "atol": "1e-08
", "equal_nan": "False
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
CheckInputImages
This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the sample data differs by more than a threshold value from zero. Many computer vision models require that input data has a zero mean and unit variance.
This rule is applicable to deep learning applications.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the CheckInputImages Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_mean |
A threshold that defines by how much mean of the input data can differ from 0. Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold_samples |
The number of images that have to be sampled before an error can be thrown. If the value is too low, the estimation of the dataset mean will be inaccurate. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
regex |
The name of the input data tensor. Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
channel |
The position of the color channel in the input tensor shape array. Optional Valid values: Integer Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
check_input_images
(), rule_parameters={ "threshold_mean
": "0.2
", "threshold_samples
": "500
", "regex
": ".*hybridsequential0_input_0
", "channel
": "1
" }, collections_to_save=[ CollectionConfig( name="custom_inputs_collection
", parameters={ "include_regex
": ".*hybridsequential0_input_0
", "save_interval
": "500
" } ) ] ) ]
NLPSequenceRatio
This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for optimizing performance. For example, you can calculate the percentage of padding end-of-sentence (EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing strategy should be performed. You also can calculate the percentage of unknown tokens in your input sequence. If the number of unknown words is too high, an alternate vocabulary could be used.
This rule is applicable to deep learning applications.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the NLPSequenceRatio Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
token_values |
A string of a list of the numerical values of the tokens. For example, "3, 0". Optional Valid values: Comma-separated string of numerical values Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
token_thresholds_percent |
A string of a list of thresholds (in percentages) that correspond
to each of the Optional Valid values: Comma-separated string of floats Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
nlp_sequence_ratio
(), rule_parameters={ "tensor_regex": ".*embedding0_input_0
", "token_values": "0
", "token_thresholds_percent": "50
" }, collections_to_save=[ CollectionConfig( name="custom_inputs_collection
", parameters={ "include_regex
": "" } ) ] ) ]
.*embedding0_input_0
Confusion
This rule evaluates the goodness of a confusion matrix for a classification problem.
It creates a matrix of size category_no*category_no
and populates it with
data coming from (labels
, predictions
) pairs. For each
(labels
, predictions
) pair, the count in
confusion[labels][predictions]
is incremented by 1. When the matrix is
fully populated, the ratio of data on-diagonal values and off-diagonal values are
evaluated as follows:
-
For elements on the diagonal:
confusion[i][i]/sum_j(confusion[j][j])>=min_diag
-
For elements off the diagonal:
confusion[j][i])/sum_j(confusion[j][i])<=max_off_diag
This rule can be applied to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the Confusion Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
category_no |
The number of categories. Optional Valid values: Integer ≥2 Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
labels |
The Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
predictions |
The Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
labels_collection |
The rule inspects the tensors in this collection for
Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
predictions_collection |
The rule inspects the tensors in this collection for
Optional Valid values: String Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
min_diag |
The minimum threshold for the ratio of data on the diagonal. Optional Valid values: Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
max_off_diag |
The maximum threshold for the ratio of data off the diagonal. Optional Valid values: Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
confusion
(), rule_parameters={ "category_no": "10
", "labels": "labels
", "predictions": "predictions
", "labels_collection": "labels
", "predictions_collection": "predictions
", "min_diag": "0.9
", "max_off_diag": "0.1
" }, collections_to_save=[ CollectionConfig( name="labels
", parameters={ "save_interval": "500
" } ), CollectionConfig( name="predictions
", parameters={ "include_regex": "500
" } ) ] ) ]
Note
This rule infers default values for the optional parameters if their values aren't specified.
FeatureImportanceOverweight
This rule accumulates the weights of the n largest feature importance values per step and ensures that they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not hold more than 80 percent of the total weights of the model.
This rule is valid only for the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the FeatureImportanceOverweight Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threshold |
Defines the threshold for the proportion of the cumulative sum of
the Optional Valid values: Float Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
nfeatures |
The number of largest features. Optional Valid values: Integer Default value: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tensor_regex |
Regular expression (regex) of tensor names the rule to analyze. Optional Valid values: String Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
feature_importance_overweight
(), rule_parameters={ "threshold": "0.8
", "nfeatures": "3
", "tensor_regex": ".*feature_importance/weight
" }, collections_to_save=[ CollectionConfig( name="feature_importance
", parameters={ "save_interval
": "500
" } ) ] ) ]
TreeDepth
This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if
they do not improve loss. This regularizes the training. As a result, the tree might not
grow as deep as defined by the depth
parameter.
This rule is valid only for the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the TreeDepth Rule | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
depth |
The depth of the tree. The depth of the tree is obtained by computing the base 2 logarithm of the largest node ID. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
tree_depth
(), rule_parameters={ "depth
": "4
" }, collections_to_save=[ CollectionConfig( name="tree
", parameters={ "save_interval
": "500
" } ) ] ) ]