Amazon SageMaker Debugger - Amazon SageMaker

Amazon SageMaker Debugger

Amazon SageMaker Debugger provides full visibility into model training by monitoring, recording, analyzing, and visualizing training process tensors. A tensor is a high-dimensional array of machine learning and deep learning metrics such as weights, gradients, and losses. It is a collection of metrics that's continuously updated during the backpropagation and optimization process of training deep learning models.

Debugger Overview

Debugger can dramatically reduce the time, resources, and cost needed to train models. By using Debugger in Amazon SageMaker Studio or Amazon SageMaker notebook instances, you can use the supported features and frameworks to inspect training job issues and use a visual interface analyze your tensor data.

SageMaker Debugger Python SDK and its client library smdebug are designed to create Python objects that enable you to interact with the saved tensors. Debugger provides tools to set up hooks and rules to save and access tensors, and to make the tensors available for analysis through its trial feature, all through flexible and powerful API operations. It supports the machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and later.

If you want to find direct resources of the Debugger and smdebug API operations, see the following documentations and pages:

Debugger-supported Frameworks with AWS Deep Learning Containers and Built-in SageMaker Algorithms

To enable Amazon SageMaker Debugger, use one of the pre-built AWS Deep Learning Containers Images and the SageMaker XGBoost built-in algorithm listed in the following tables.

Available Frameworks to Use Debugger

Available frameworks of AWS Deep Learning Containers and XGBoost to use Debugger. Each of these pre-built containers already has the Debugger feature installed, and you can directly run your training script without any change and debug training jobs. If you use one of the frameworks fully supported by Debugger, your training job will be automatically configured with pre-installed hook registrations to your training script. To find the lasted version updates and release notes, see the following table and links to the frameworks or algorithms that you want to use.

Framework Versions


1.15, 2.1*, 2.2, 2.3


1.6, 1.7


1.4, 1.5, 1.6

XGBoost (as a built-in algorithm)

0.90-2, 1.0-1

To start using the AWS containers and Debugger, see Use Debugger in AWS Containers.

* Debugger with zero script change is partially available for these TensorFlow versions. The inputs, outputs, gradients, and layers built-in collections are currently not available for these TensorFlow versions.

Available Frameworks to Use Debugger with Script Mode

If you want to use Debugger with a framework of a version not listed in the previous table or manually control the Debugger hook registration in your training script, you need to run your training job with script mode on the SageMaker training containers. To find more information about available SageMaker containers and how to register the Debugger hooks using the smdebug library, see the following table and links to the frameworks or algorithms that you want to use.

Framework Versions


1.13, 1.14, 1.15, 2.1, 2.2, 2.3

Keras with TensorFlow backend



1.4, 1.5, 1.6, 1.7


1.2, 1.3, 1.4, 1.5, 1.6

XGBoost (as a framework)

0.90-2, 1.0-1

To start using the AWS containers and Debugger with script mode, go to Debugger in AWS Containers with Script Mode.

Use Debugger with Custom Containers

Amazon SageMaker Debugger is available for any deep learning models that you bring to SageMaker. The AWS CLI, SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models.

To use Debugger with customized containers, see Use Debugger with Custom Training Containers.

How Debugger Works

With Amazon SageMaker Debugger, you can go beyond just looking at scalars, like losses and accuracies, when evaluating model training. Debugger gives you full visibility into a training job by using a hook to capture tensors that define the state of the training process at each point in the job's lifecycle. Debugger also provides rules to inspect the captured tensors.

Built-in rules monitor the training flow and alert you to problems with common conditions that are critical for the success of the training job. You can also create your own custom rules to watch for any issues specific to your model. You can monitor the results of the analysis done by rules with Amazon CloudWatch events, using an Amazon SageMaker notebook, or using visualizations provided by Amazon SageMaker Studio.

The following diagram shows the flow for the model training process with Amazon SageMaker Debugger.

                Overview of how Amazon SageMaker Debugger works.

To use Amazon SageMaker Debugger hooks and rules, you activate them by adding only a few lines of code in your estimator object. Debugger and its client library smdebug help you set up hooks and rules that give you transparency into training jobs. Debugger and smdebug support the major machine learning frameworks (that is, TensorFlow, MXNet, and PyTorch) and the SageMaker pre-built algorithm XGBoost, while you run training jobs in the SageMaker environment.

  • Choose a framework and use SageMaker pre-built training containers – SageMaker provides you with the option to use AWS Deep Learning Containers, the XGBoost built-in algorithm container, or custom containers to run training jobs. Combination of Debugger features and the pre-built AWS training containers helps make your model debugging process simpler and more transparent. You can build containers using Deep Learning Containers base images if you prefer to customize your training job and enable Debugger using the SageMaker Python SDK. To learn more, go to Use Debugger in AWS Containers and Use Debugger with Custom Training Containers.

  • Use a Debugger hook to save tensors – After you choose a container and a framework that fit your training script, use a Debugger hook to configure which tensors to save and to which directory to save them, such as a Amazon S3 bucket. A Debugger hook helps you to build the configuration and to keep it in your account to use in subsequent analyses, where it is secured for use with the most privacy-sensitive applications. To learn more, see Configure and Save Tensor Data Using the Debugger API Operations.

  • Use Debugger rules to inspect tensors in parallel with a training job – To analyze tensors, Debugger provides built-in rules for over a dozen abnormal training process behaviors. For example, a Debugger rule detects issues when the training process suffers from vanishing gradients, exploding tensors, overfitting, or overtraining. If necessary, you can build customized rules to analyze saved tensors using the Amazon SageMaker Debugger SDK. To learn more about the Debugger rules, see Use Debugger Built-in Rules for Training Job Analysis for detailed instructions and code examples, List of Debugger Built-in Rules for a full list of the Debugger rules, and Create Debugger Custom Rules for Training Job Analysis for customizing Debugger rules. In combination with Debugger rules, you can also use Amazon CloudWatch Events to invoke an AWS Lambda function that automatically stops the training job when the Debugger rules detect problems and trigger "IssuesFound" status. To configure the automated training job termination using Debugger, see Action on Amazon SageMaker Debugger Rules Using Amazon CloudWatch and AWS Lambda.

  • Create trials to analyze tensors – The smdebug trial is an object that lets you query the saved tensors from a given training job, specified by the path to which smdebug artifacts are saved. A trial is capable of loading new tensors as they become available at a given path, enabling you to do both offline and real-time analysis. To learn more about the smdebug trial, see the smdebug trial API.

  • Use SageMaker Studio for visualization – You can use Debugger in SageMaker Studio to visualize collected trials by Debugger. SageMaker Studio makes inspecting training job issues easier through its visual interface for analyzing your tensor data. To learn more, see Amazon SageMaker Studio Visualization Demos of Model Analysis with Debugger.

Amazon SageMaker Debugger Tutorial Videos

The following video series provides a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker notebook instances.

Debug Models with Amazon SageMaker Debugger in Studio

Julien Simon, AWS Technical Evangelist | Length: 14 minutes 17 seconds

This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect debugging information from a training model. The example training model used in this video is a simple convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker in a TensorFlow framework and Debugger enable you to build an estimator directly using the training script and debug the training job.

You can find the example notebook in the video in this Studio Demo repository provided by the author. You need to clone the debugger.ipynb notebook file and the training script to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the path keras_script_path to the file inside the debugger.ipynb notebook. For example, if you cloned the two files in the same directory, set it as keras_script_path = "".

Deep Dive on Amazon SageMaker Debugger and SageMaker Model Monitor

Julien Simon, AWS Technical Evangelist | Length: 44 minutes 34 seconds

This video session explores advanced features of Debugger and SageMakerModel Monitor that help boost productivity and the quality of your models. First, this video shows how to detect and fix training issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to monitor models in production and identify prediction issues such as missing features or data drift using SageMaker Model Monitor. Finally, it offers cost optimization tips to help you make the most of your machine learning budget.

You can find the example notebook in the video in this AWS Dev Days 2020 repository offered by the author.