Configure Debugger Built-in Rules
Amazon SageMaker Debugger rules analyze tensors emitted during the training of a model. Debugger offers the
Rule
API operation that monitors training job progress and errors for the
success of training your model. For example, the rules can detect whether gradients are
getting too large or too small, whether a model is overfitting or overtraining, and whether
a training job does not decrease loss function and improve.
To see a full list of available built-in rules,
see List of Debugger Built-in Rules.
The built-in rules are prepared in Amazon SageMaker processing containers and fully managed by SageMaker Debugger. By default, Debugger initiates the ProfilerReport rule for all SageMaker training jobs, without any Debugger-specific rule parameter specified to the SageMaker estimators. The ProfilerReport rule invokes all of the following built-in rules for monitoring system bottlenecks and profiling framework metrics:
Debugger saves the profiling report in a default S3 bucket. The format of the default
S3 bucket URI is
s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/rule-output/
.
For more information about how to download the profiling report, see SageMaker Debugger Profiling Report. SageMaker Debugger fully manages
the built-in rules and analyzes your training job in parallel. For more information about billing,
see the Amazon SageMaker Studio is available at no additional charge section
of the Amazon SageMaker Pricing
In the following topics, learn how to use the Debugger built-in rules.
Topics
Use Debugger Built-in Rules with the Default Parameter Settings
To specify Debugger built-in rules in an estimator, you need to configure a rules
list
object. The following example code shows the basic structure of listing the Debugger
built-in rules:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules=[ ProfilerRule.sagemaker(rule_configs.
BuiltInProfilerRuleName_1
()), ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2
()), ... ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n
()), Rule.sagemaker(rule_configs.built_in_rule_name_1
()), Rule.sagemaker(rule_configs.built_in_rule_name_2
()), ... Rule.sagemaker(rule_configs.built_in_rule_name_n
()) ]
For more information about default parameter values and descriptions of the built-in rule, see List of Debugger Built-in Rules.
For example, to inspect the overall training performance and progress of your model, construct a SageMaker estimator with the following built-in rule configuration.
from sagemaker.debugger import Rule, rule_configs rules=[ ProfilerRule.sagemaker(rule_configs.
ProfilerReport
()), Rule.sagemaker(rule_configs.loss_not_decreasing
()), Rule.sagemaker(rule_configs.overfit
()), Rule.sagemaker(rule_configs.overtraining
()), Rule.sagemaker(rule_configs.stalled_training_rule
()) ]
When you start the training job, Debugger collects system resource utilization data
every 500 milliseconds and the loss and accuracy values every 500 steps by default.
Debugger analyzes the resource utilization to identify if your model is having
bottleneck problems. The loss_not_decreasing
, overfit
,
overtraining
, and stalled_training_rule
monitors if your
model is optimizing the loss function without those training issues. If the rules detect
training anomalies, the rule evaluation status changes to IssueFound
. You
can set up automated actions, such as notifying training issues and stopping training
jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see Action on Amazon SageMaker Debugger
Rules.
Use Debugger Built-in Rules with Custom Parameter Values
If you want to adjust the built-in rule parameter values and customize tensor
collection regex, configure the base_config
and
rule_parameters
parameters for the ProfilerRule.sagemaker
and Rule.sagemaker
classmethods. In case of the Rule.sagemaker
class methods, you can also customize tensor collections through the
collections_to_save
parameter. The instruction of how to use the
CollectionConfig
class is provided at Configure Debugger Tensor
Collections Using the CollectionConfig API Operation.
Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be triggered.
-
The
base_config
argument is where you call the built-in rule methods. -
The
rule_parameters
argument is to adjust the default key values of the built-in rules listed in List of Debugger Built-in Rules. -
The
collections_to_save
argument takes in a tensor configuration through theCollectionConfig
API, which requiresname
andparameters
arguments.-
To find available tensor collections for
name
, see Debugger Built-in Tensor Collections. -
For a full list of adjustable
parameters
, see Debugger CollectionConfig API.
-
For more information about the Debugger rule class, methods, and parameters, see SageMaker
Debugger Rule class
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig rules=[ ProfilerRule.sagemaker( base_config=rule_configs.
BuiltInProfilerRuleName
(), rule_parameters={ "key
": "value
" } ) Rule.sagemaker( base_config=rule_configs.built_in_rule_name
(), rule_parameters={ "key
": "value
" } collections_to_save=[ CollectionConfig( name="tensor_collection_name
", parameters={ "key
": "value
" } ) ] ) ]
The parameter descriptions and value customization examples are provided for each rule at List of Debugger Built-in Rules.
Example Notebooks and Code Samples to Configure Debugger Rules
In the following sections, notebooks and code samples of how to use Debugger rules to monitor SageMaker training jobs are provided.
Topics
Debugger Built-in Rules Example Notebooks
The following example notebooks show how to use Debugger built-in rules when running training jobs with Amazon SageMaker:
While running the example notebooks in SageMaker Studio, you can find the training job
trial created on the Studio Experiment List tab. For
example, as shown in the following screenshot, you can find and open a
Describe Trial Component window of your current training job.
On the Debugger tab, you can check if the Debugger rules, vanishing_gradient()
and loss_not_decreasing()
, are monitoring the training session in parallel.
For a full instruction of how to find your training job trial components in the Studio
UI, see SageMaker Studio -
View Experiments, Trials, and Trial Components.

There are two ways of using the Debugger built-in rules in the SageMaker environment: deploy the built-in rules as it is prepared or adjust their parameters as you want. The following topics show you how to use the built-in rules with example codes.
Debugger Built-in Rules Example Code
The following code sample shows how to set the Debugger built-in rules using the
Rule.sagemaker
method. To specify built-in rules that you want to run,
use the rules_configs
API operation to call the built-in rules.
To find a full list of Debugger built-in rules and default parameter values, see
List of Debugger Built-in Rules.
import sagemaker from sagemaker.tensorflow import TensorFlow from sagemaker.debugger import Rule, CollectionConfig, rule_configs # call built-in rules that you want to use. built_in_rules=[ Rule.sagemaker(rule_configs.vanishing_gradient()) Rule.sagemaker(rule_configs.loss_not_decreasing()) ] # construct a SageMaker estimator with the Debugger built-in rules sagemaker_estimator=TensorFlow( entry_point='directory/to/your_training_script.py', role=sm.get_execution_role(), base_job_name='debugger-built-in-rules-demo', instance_count=1, instance_type="ml.m4.xlarge", framework_version="2.1.0", py_version="py3", # debugger-specific arguments below rules=built_in_rules ) sagemaker_estimator.fit()
The Debugger built-in rules run in parallel with your training job. The maximum number of built-in rule containers for a training job is 20.
For more information about the Debugger rule class, methods, and parameters, see the
SageMaker Debugger Rule class
To find an example of how to adjust the Debugger rule parameters, see the following Use Debugger Built-in Rules with Parameter Modifications section.
Use Debugger Built-in Rules with Parameter Modifications
The following code example shows the structure of built-in rules to adjust parameters.
In this example, the stalled_training_rule
collects the losses
tensor collection from a training job at every 50 steps and an evaluation stage at every
10 steps. If the training process starts stalling and not collecting tensor outputs for
120 seconds, the stalled_training_rule
stops the training job.
import sagemaker from sagemaker.tensorflow import TensorFlow from sagemaker.debugger import Rule, CollectionConfig, rule_configs # call the built-in rules and modify the CollectionConfig parameters base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time())) built_in_rules_modified=[ Rule.sagemaker( base_config=rule_configs.stalled_training_rule(), rule_parameters={ 'threshold': '120', 'training_job_name_prefix': base_job_name_prefix, 'stop_training_on_fire' : 'True' } collections_to_save=[ CollectionConfig( name="losses", parameters={ "train.save_interval": "50" "eval.save_interval": "10" } ) ] ) ] # construct a SageMaker estimator with the modified Debugger built-in rule sagemaker_estimator=TensorFlow( entry_point='directory/to/your_training_script.py', role=sm.get_execution_role(), base_job_name=base_job_name_prefix, instance_count=1, instance_type="ml.m4.xlarge", framework_version="2.1.0", py_version="py3", # debugger-specific arguments below rules=built_in_rules_modified ) sagemaker_estimator.fit()
For an advanced configuration of the Debugger built-in rules using the CreateTrainingJob
API,
see Configure Debugger Using Amazon SageMaker API.