List of Debugger Built-in Rules
Use the Debugger built-in rules provided by Amazon SageMaker Debugger and analyze metrics and tensors collected
while training your models. The Debugger built-in rules monitor various common conditions that
are critical for the success of a training job. You can call the built-in rules using
Amazon SageMaker Python SDK
Note
The maximum numbers of built-in rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.
Important
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.
import sys import IPython !{sys.executable} -m pip install -U sagemaker smdebug IPython.Application.instance().kernel.do_shutdown(True)
Debugger Rule
The following rules are the Debugger built-in rules that are callable using the
Rule.sagemaker
classmethod.
Debugger built-in rules for generating training reports
Scope of Validity | Built-in Rules |
---|---|
Training Report for SageMaker XGboost training job |
Debugger built-in rules for debugging model training data (output tensors)
Scope of Validity | Built-in Rules |
---|---|
Deep learning frameworks (TensorFlow, MXNet, and PyTorch) | |
Deep learning frameworks (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm | |
Deep learning applications | |
XGBoost algorithm |
To use the built-in rules with default parameter values – use the following configuration format:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ Rule.sagemaker(rule_configs.
built_in_rule_name_1
()), Rule.sagemaker(rule_configs.built_in_rule_name_2
()), ... Rule.sagemaker(rule_configs.built_in_rule_name_n
()) ]
To use the built-in rules with customizing the parameter values – use the following configuration format:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ Rule.sagemaker( base_config=rule_configs.
built_in_rule_name
(), rule_parameters={ "key
": "value
" } collections_to_save=[ CollectionConfig( name="tensor_collection_name
", parameters={ "key
": "value
" } ) ] ) ]
To find available keys for the rule_parameters
parameter, see the parameter description tables.
Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
-
For a full instruction and examples of using the Debugger built-in rules, see Debugger Built-in Rules Example Code.
-
For a full instruction on using the built-in rules with the low-level SageMaker API operations, see Configure Debugger Using Amazon SageMaker API.
CreateXgboostReport
The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a comprehensive training report. You can download a comprehensive profiling report while a training job is running or after the training job is complete, and check progress of training or the final result of the training job. The CreateXgboostReport rule collects the following output tensors by default:
-
hyperparameters
– Saves at the first step -
metrics
– Saves loss and accuracy every 5 steps -
feature_importance
– Saves every 5 steps -
predictions
– Saves every 5 steps -
labels
– Saves every 5 steps
Parameter Descriptions for the CreateXgboostReport Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
rules=[ Rule.sagemaker( rule_configs.
create_xgboost_report
() ) ]
DeadRelu
This rule detects when the percentage of rectified linear unit (ReLU) activation
functions in a trial are considered dead because their activation activity has dropped
below a threshold. If the percent of inactive ReLUs in a layer is greater than the
threshold_layer
value of inactive ReLUs, the rule returns
True
.
Parameter Descriptions for the DeadRelu Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
threshold_inactivity |
Defines a level of activity below which a ReLU is considered to be
dead. A ReLU might be active in the beginning of a trial and then
slowly die during the training process. If the ReLU is active less
than the Optional Valid values: Float Default values: |
threshold_layer |
Returns Returns Optional Valid values: Float Default values: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.dead_relu(), rule_parameters={ "tensor_regex": "
.*relu_output|.*ReLU_output
", "threshold_inactivity": "1.0
", "threshold_layer": "50.0
" }, collections_to_save=[ CollectionConfig( name="custom_relu_collection
", parameters={ "include_regex: ".*relu_output|.*ReLU_output
", "save_interval": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
ExplodingTensor
This rule detects whether the tensors emitted during training have non-finite values,
either infinite or NaN (not a number). If a non-finite value is detected, the rule
returns True
.
Parameter Descriptions for the ExplodingTensor Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: String Default value: |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
only_nan |
Optional Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
exploding_tensor
(), rule_parameters={ "tensor_regex": ".*gradient
", "only_nan": "False
" }, collections_to_save=[ CollectionConfig( name="gradients
", parameters={ "save_interval": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
PoorWeightInitialization
This rule detects if your model parameters have been poorly initialized.
Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively. Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an initialization can lead to exploding gradients. This rule checks the variance of activation inputs across layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural network has been poorly initialized.
Parameter Descriptions for the PoorWeightInitialization Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
activation_inputs_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
threshold |
If the ratio between minimum and maximum variance of weights per
layer exceeds the Optional Valid values: Float Default value: |
distribution_range |
If the minimum difference between 5th and 95th percentiles of the
gradient distribution is less than the
Optional Valid values: Float Default value: |
patience |
The number of steps to wait until the loss is considered to be no longer decreasing. Optional Valid values: Integer Default value: |
steps |
The number of steps this rule analyzes. You typically need to check only the first few iterations. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
poor_weight_initialization
(), rule_parameters={ "activation_inputs_regex": ".*relu_input|.*ReLU_input
", "threshold": "10.0
", "distribution_range": "0.001
", "patience": "5
", "steps": "10
" }, collections_to_save=[ CollectionConfig( name="custom_relu_collection
", parameters={ "include_regex
": ".*relu_input|.*ReLU_input
", "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
SaturatedActivation
This rule detects if the tanh and sigmoid activation layers are becoming saturated. An
activation layer is saturated when the input of the layer is close to the maximum or
minimum of the activation function. The minimum and maximum of the tanh and sigmoid
activation functions are defined by their respective min_threshold
and
max_thresholds
values. If the activity of a node drops below the
threshold_inactivity
percentage, it is considered saturated. If more
than a threshold_layer
percent of the nodes are saturated, the rule returns
True
.
Parameter Descriptions for the SaturatedActivation Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: String Default value: |
threshold_tanh_min |
The minimum and maximum thresholds that define the extremes of the
input for a tanh activation function, defined as:
Optional Valid values: Float Default values: |
threshold_tanh_max |
The minimum and maximum thresholds that define the extremes of the
input for a tanh activation function, defined as:
Optional Valid values: Float Default values: |
threshold_sigmoid_min |
The minimum and maximum thresholds that define the extremes of the
input for a sigmoid activation function, defined as:
Optional Valid values: Float Default values: |
threshold_sigmoid_max |
The minimum and maximum thresholds that define the extremes of the
input for a sigmoid activation function, defined as:
Optional Valid values: Float Default values: |
threshold_inactivity |
The percentage of inactivity below which the activation layer is considered to be saturated. The activation might be active in the beginning of a trial and then slowly become less active during the training process. Optional Valid values: Float Default values: |
threshold_layer |
Returns Returns Optional Valid values: Float Default values: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
saturated_activation
(), rule_parameters={ "tensor_regex": ".*tanh_input|.*sigmoid_input
", "threshold_tanh_min": "-9.4999
", "threshold_tanh_max": "9.4999
", "threshold_sigmoid_min": "-23
", "threshold_sigmoid_max": "16.99999
", "threshold_inactivity": "1.0
", "threshold_layer": "50.0
" }, collections_to_save=[ CollectionConfig( name="custom_activations_collection
", parameters={ "include_regex
": ".*tanh_input|.*sigmoid_input
" "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
VanishingGradient
This rule detects if the gradients in a trial become extremely small or drop to a zero
magnitude. If the mean of the absolute values of the gradients drops below a specified
threshold
, the rule returns True
.
Parameters Descriptions for the VanishingGradient Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
threshold |
The value at which the gradient is determined to be
vanishing. Optional Valid values: Float Default value:
|
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
vanishing_gradient
(), rule_parameters={ "threshold": "0.0000001
" }, collections_to_save=[ CollectionConfig( name="gradients
", parameters={ "save_interval
": "500
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
WeightUpdateRatio
This rule keeps track of the ratio of updates to weights during training and detects
if that ratio gets too large or too small. If the ratio of updates to weights is larger
than the large_threshold value
or if this ratio is smaller than
small_threshold
, the rule returns True
.
Conditions for training are best when the updates are commensurate to gradients.
Excessively large updates can push the weights away from optimal values, and very small
updates result in very slow convergence. This rule requires weights to be available for
two training steps, and train.save_interval
needs to be set equal to
num_steps
.
Parameter Descriptions for the WeightUpdateRatio Rule
Parameter Name, | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
num_steps |
The number of steps across which the rule checks to determine if the tensor has changed. The number of steps across which you want to compare the weight
ratios. If you pass no value, the rule runs by default against the
current step and the immediately previous saved step. If you
override the default by passing a value for this parameter, the
comparison is done between weights at step Optional Valid values: Integer Default value: |
large_threshold |
The maximum value that the ratio of updates to weight can take
before the rule returns Optional Valid values: Float Default value: |
small_threshold |
The minimum value that the ratio of updates to weight can take,
below which the rule returns Optional Valid values: Float Default value: |
epsilon |
A small constant used to ensure that Debugger does not divide by zero when computing the ratio updates to weigh. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
weight_update_ratio
(), rule_parameters={ "num_steps": "100
", "large_threshold": "10.0
", "small_threshold": "0.00000001
", "epsilon": "0.000000001
" }, collections_to_save=[ CollectionConfig( name="weights
", parameters={ "train.save_interval
": "100
" } ) ] ) ]
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
This rule is not available for the XGBoost algorithm.
AllZero
This rule detects if all or a specified percentage of the tensor values are zero.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameters Descriptions for the AllZero Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
threshold |
Specifies the percentage of values in the tensor that needs to be zero for this rule to be invoked. Optional Valid values: Float Default value: 100 (in percentage) |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
all_zero
(), rule_parameters={ "tensor_regex": ".*
", "threshold": "100
" }, collections_to_save=[ CollectionConfig( name="all
", parameters={ "save_interval
": "500
" } ) ] ) ]
ClassImbalance
This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.
Classification models require well-balanced classes in the training dataset or a proper weighting/sampling of classes during training. The rule performs the following checks:
-
It counts the occurrences per class. If the ratio of number of samples between smallest and largest class is larger than the
threshold_imbalance
, an error is thrown. -
It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied, then the model can reach high accuracy for the class with many training samples, but low accuracy for the classes with few training samples. If a fraction of mispredictions for a certain class is above
threshold_misprediction
, an error is thrown.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the ClassImbalance Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
threshold_imbalance |
The acceptable imbalance between the number of samples in the smallest class and in the largest class. Exceeding this threshold value throws an error. Optional Valid values: Float Default value: |
threshold_misprediction |
A limit on the fraction of mispredictions allowed for each class. Exceeding this threshold throws an error. The underrepresented classes are most at risk of crossing this threshold. Optional Valid values: Float Default value: |
samples |
The number of labels that have to be processed before an imbalance
is evaluated. The rule might not be triggered until it has seen
sufficient samples across several steps. The more classes that your
dataset contains, the larger this Optional Valid values: Integer Default value: |
argmax |
If Conditional Valid values: Boolean Default value: |
labels_regex |
The name of the tensor that contains the labels. Optional Valid values: String Default value: |
predictions_regex |
The name of the tensor that contains the predictions. Optional Valid values: String Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
class_imbalance
(), rule_parameters={ "threshold_imbalance": "10
", "threshold_misprediction": "0.7
", "samples": "500
", "argmax": "False
", "labels_regex": ".*labels
", "predictions_regex": ".*predictions
" }, collections_to_save=[ CollectionConfig( name="custom_output_collection
", parameters={ "include_regex": ".*labels|.*predictions
", "save_interval": "500
" } ) ] ) ]
LossNotDecreasing
This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be scalars.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the LossNotDecreasing Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: |
tensor_regex |
A list of regex patterns that is used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
use_losses_collection |
If set to Optional Valid values: Boolean Default value: |
num_steps |
The minimum number of steps after which the rule checks if the
loss has decreased. Rule evaluation happens every
Optional Valid values: Integer Default value: |
diff_percent |
The minimum percentage difference by which the loss should
decrease between Optional Valid values: Default value: |
increase_threshold_percent |
The maximum threshold percent that loss is allowed to increase in case loss has been increasing Optional Valid values: Default value: |
mode |
The name of the Debugger mode to query tensor values for rule
checking. If this is not passed, the rule checks in order by default
for the Optional Valid values: String ( Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
loss_not_decreasing
(), rule_parameters={ "tensor_regex": ".*
", "use_losses_collection": "True
", "num_steps": "10
", "diff_percent": "0.1
", "increase_threshold_percent": "5
", "mode": "GLOBAL
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
Overfit
This rule detects if your model is being overfit to the training data by comparing the validation and training losses.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
A standard way to prevent overfitting is to regularize your model.
Parameter Descriptions for the Overfit Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
start_step |
The step from which to start comparing the validation and training loss. Optional Valid values: Integer Default value: |
patience |
The number of steps for which the Optional Valid values: Integer Default value: |
ratio_threshold |
The maximum ratio of the difference between the mean validation
loss and mean training loss to the mean training loss. If this
threshold is exceeded for a Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
overfit
(), rule_parameters={ "tensor_regex": ".*
", "start_step": "0
", "patience": "1
", "ratio_threshold": "0.1
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "train.save_interval": "100
", "eval.save_interval": "10
" } ) ] ) ]
Overtraining
This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved model (both training and validation loss decrease), the model approaches to a minimum of the loss function and does not improve anymore. If the model continues training it can happen that validation loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to determine if the model is not improving, and prevents overfitting problems due to overtraining.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Note
Overtraining can be avoided by early stopping. For information on early stopping,
see Stop Training Jobs Early. For an example that
shows how to use spot training with Debugger, see Enable Spot Training with Amazon SageMaker Debugger
Parameter Descriptions for the Overtraining Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
patience_train |
The number of steps to wait before the training loss is considered to not to be improving anymore. Optional Valid values: Integer Default value: |
patience_validation |
The number of steps to wait before the validation loss is considered
to not to be improving anymore. Optional Valid values: Integer Default value: |
delta |
The minimum threshold by how much the error should improve before it is considered as a new optimum. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
overtraining
(), rule_parameters={ "patience_train": "5
", "patience_validation": "10
", "delta": "0.01
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
SimilarAcrossRuns
This rule compares tensors gathered from a base trial with tensors from another trial.
This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the SimilarAcrossRuns Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
other_trials |
A completed training job name whose tensors you want to compare to
those tensors gathered from the current
Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
similar_across_runs
(), rule_parameters={ "other_trials": "<specify-another-job-name>
", "collection_names": "losses
", "tensor_regex": ".*
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
StalledTrainingRule
StalledTrainingRule detects if there is no progress made on training job, and stops
the training job if the rule fires. This rule requires tensors to be periodically saved
in a time interval defined by its threshold
parameter. This rule keeps on
monitoring for new tensors, and if no new tensor has been emitted for threshold interval
rule gets fired.
Parameter Descriptions for the StalledTrainingRule Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
threshold |
A threshold that defines by how much time in seconds the rule waits for a tensor output until it fires a stalled training issue. Default value is 1800 seconds. Optional Valid values: Integer Default value: |
stop_training_on_fire |
If set to Optional Valid values: Boolean Default value: |
training_job_name_prefix |
The prefix of base training job name. If
Optional Valid values: String |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
stalled_training_rule
(), rule_parameters={ "threshold": "1800
", "stop_training_on_fire": "True
", "training_job_name_prefix": "<specify-training-base-job-name>
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
TensorVariance
This rule detects if you have tensors with very high or low variances. Very high or low variances in a tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues early.
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the TensorVariance Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
max_threshold |
The threshold for the upper bound of tensor variance. Optional Valid values: Float Default value: None |
min_threshold |
The threshold for the lower bound of tensor variance. Optional Valid values: Float Default value: None |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
tensor_variance
(), rule_parameters={ "collection_names": "weights
", "max_threshold": "10
", "min_threshold": "0.00001
", }, collections_to_save=[ CollectionConfig( name="weights
", parameters={ "save_interval
": "500
" } ) ] ) ]
UnchangedTensor
This rule detects whether a tensor is no longer changing across steps.
This rule runs the numpy.allclose
This rule can be applied either to one of the supported deep learning frameworks
(TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either
the collection_names
or tensor_regex
parameter. If both the
parameters are specified, the rule inspects the union of tensors from both sets.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the UnchangedTensor Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
collection_names |
The list of collection names whose tensors the rule inspects. Optional Valid values: List of strings or a comma-separated string Default value: None |
tensor_regex |
A list of regex patternsused to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: None |
num_steps |
The number of steps across which the rule checks to determine if the tensor has changed. This checks the last Optional Valid values: Integer Default value: |
rtol |
The relative tolerance parameter to be passed to the Optional Valid values: Float Default value: |
atol |
The absolute tolerance parameter to be passed to the Optional Valid values: Float Default value: |
equal_nan |
Whether to compare NaNs as equal. If Optional Valid values: Boolean Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
unchanged_tensor
(), rule_parameters={ "collection_names": "losses
", "tensor_regex": "", "num_steps": "3
", "rtol": "1e-05
", "atol": "1e-08
", "equal_nan": "False
" }, collections_to_save=[ CollectionConfig( name="losses
", parameters={ "save_interval
": "500
" } ) ] ) ]
CheckInputImages
This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the sample data differs by more than a threshold value from zero. Many computer vision models require that input data has a zero mean and unit variance.
This rule is applicable to deep learning applications.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the CheckInputImages Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
threshold_mean |
A threshold that defines by how much mean of the input data can differ from 0. Optional Valid values: Float Default value: |
threshold_samples |
The number of images that have to be sampled before an error can be thrown. If the value is too low, the estimation of the dataset mean will be inaccurate. Optional Valid values: Integer Default value: |
regex |
The name of the input data tensor. Optional Valid values: String Default value: |
channel |
The position of the color channel in the input tensor shape array. Optional Valid values: Integer Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
check_input_images
(), rule_parameters={ "threshold_mean
": "0.2
", "threshold_samples
": "500
", "regex
": ".*hybridsequential0_input_0
", "channel
": "1
" }, collections_to_save=[ CollectionConfig( name="custom_inputs_collection
", parameters={ "include_regex
": ".*hybridsequential0_input_0
", "save_interval
": "500
" } ) ] ) ]
NLPSequenceRatio
This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for optimizing performance. For example, you can calculate the percentage of padding end-of-sentence (EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing strategy should be performed. You also can calculate the percentage of unknown tokens in your input sequence. If the number of unknown words is too high, an alternate vocabulary could be used.
This rule is applicable to deep learning applications.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the NLPSequenceRatio Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
tensor_regex |
A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. Optional Valid values: List of strings or a comma-separated string Default value: |
token_values |
A string of a list of the numerical values of the tokens. For example, "3, 0". Optional Valid values: Comma-separated string of numerical values Default value: |
token_thresholds_percent |
A string of a list of thresholds (in percentages) that correspond
to each of the Optional Valid values: Comma-separated string of floats Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
nlp_sequence_ratio
(), rule_parameters={ "tensor_regex": ".*embedding0_input_0
", "token_values": "0
", "token_thresholds_percent": "50
" }, collections_to_save=[ CollectionConfig( name="custom_inputs_collection
", parameters={ "include_regex
": "" } ) ] ) ]
.*embedding0_input_0
Confusion
This rule evaluates the goodness of a confusion matrix for a classification problem.
It creates a matrix of size category_no*category_no
and populates it with
data coming from (labels
, predictions
) pairs. For each
(labels
, predictions
) pair, the count in
confusion[labels][predictions]
is incremented by 1. When the matrix is
fully populated, the ratio of data on-diagonal values and off-diagonal values are
evaluated as follows:
-
For elements on the diagonal:
confusion[i][i]/sum_j(confusion[j][j])>=min_diag
-
For elements off the diagonal:
confusion[j][i])/sum_j(confusion[j][i])<=max_off_diag
This rule can be applied to the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the Confusion Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
category_no |
The number of categories. Optional Valid values: Integer ≥2 Default value: |
labels |
The Optional Valid values: String Default value: |
predictions |
The Optional Valid values: String Default value: |
labels_collection |
The rule inspects the tensors in this collection for
Optional Valid values: String Default value: |
predictions_collection |
The rule inspects the tensors in this collection for
Optional Valid values: String Default value: |
min_diag |
The minimum threshold for the ratio of data on the diagonal. Optional Valid values: Default value: |
max_off_diag |
The maximum threshold for the ratio of data off the diagonal. Optional Valid values: Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
confusion
(), rule_parameters={ "category_no": "10
", "labels": "labels
", "predictions": "predictions
", "labels_collection": "labels
", "predictions_collection": "predictions
", "min_diag": "0.9
", "max_off_diag": "0.1
" }, collections_to_save=[ CollectionConfig( name="labels
", parameters={ "save_interval": "500
" } ), CollectionConfig( name="predictions
", parameters={ "include_regex": "500
" } ) ] ) ]
Note
This rule infers default values for the optional parameters if their values aren't specified.
FeatureImportanceOverweight
This rule accumulates the weights of the n largest feature importance values per step and ensures that they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not hold more than 80 percent of the total weights of the model.
This rule is valid only for the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the FeatureImportanceOverweight Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
threshold |
Defines the threshold for the proportion of the cumulative sum of
the Optional Valid values: Float Default value: |
nfeatures |
The number of largest features. Optional Valid values: Integer Default value: |
tensor_regex |
Regular expression (regex) of tensor names the rule to analyze. Optional Valid values: String Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
feature_importance_overweight
(), rule_parameters={ "threshold": "0.8
", "nfeatures": "3
", "tensor_regex": ".*feature_importance/weight
" }, collections_to_save=[ CollectionConfig( name="feature_importance
", parameters={ "save_interval
": "500
" } ) ] ) ]
TreeDepth
This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if
they do not improve loss. This regularizes the training. As a result, the tree might not
grow as deep as defined by the depth
parameter.
This rule is valid only for the XGBoost algorithm.
For an example of how to configure and deploy a built-in rule, see Configure Debugger Built-in Rules.
Parameter Descriptions for the TreeDepth Rule
Parameter Name | Description |
---|---|
base_trial |
The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. Required Valid values: String |
depth |
The depth of the tree. The depth of the tree is obtained by computing the base 2 logarithm of the largest node ID. Optional Valid values: Float Default value: |
built_in_rules = [ Rule.sagemaker( base_config=rule_configs.
tree_depth
(), rule_parameters={ "depth
": "4
" }, collections_to_save=[ CollectionConfig( name="tree
", parameters={ "save_interval
": "500
" } ) ] ) ]