Amazon SageMaker Debugger - Amazon SageMaker

Amazon SageMaker Debugger

Amazon SageMaker Debugger provides full visibility of model training by monitoring, recording, analyzing, and visualizing tensors of the training process. A tensor is defined as a high dimensional array of the machine learning and deep learning metrics such as weights, gradients, and losses; in other words, it is a collection of metrics continuously updated during the backpropagation and optimization process of training deep learning models.

Debugger provides a hook feature to configure the shape of tensors as you want, capture tensors from the entire training process or a target training process to debug, and save the tensors in Amazon S3 buckets.

A Debugger rule is a feature that adds an automated alert triggering system that detects anomalies by watching the tensors collected by the Debugger hook during model training. For example, a Debugger rule detects issues when the training process suffers from vanishing gradients, exploding tensors, overfitting, or overtraining.

If such issues occur, the rule changes status of the training job and triggers CloudWatch Events. You can create a CloudWatch Rule to direct Training Job State change events to an AWS Lambda function that stops the training job in case a rule triggers and has status "IssuesFound".

Offering the flow of debugging process, Debugger can dramatically reduce time, resource, and cost needed to train models. Using Debugger in Amazon SageMaker Studio or Amazon SageMaker Notebook instances makes inspecting training job issues easier with supported features and frameworks and it provides a visual interface for analyzing your tensor data.

SageMaker Debugger Python SDK and its client library smdebug are designed to create Python objects that allow you to interact with the saved tensors. Debugger provides you tools to setup hooks and rules to easily save and access tensors, make the tensors available for analysis through its trial feature, all through a flexible and powerful API. It supports machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and above.

If you want to explore information of Debugger API in detail, see the following pages:

Amazon SageMaker Debugger Sample Notebooks

The following list of Amazon SageMaker Debugger sample notebooks show how to set up Amazon SageMaker training jobs to configure Debugger hooks to save the tensors, apply Debugger rules to the tensors to monitor the status of training jobs, and visualize them.

We recommend using the notebooks in the following order:

Amazon SageMaker Debugger Tutorial Videos

Here is a video series that provides a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker Notebook instances.

Debug Models with Amazon SageMaker Debugger in Studio

Author: Julien Simon, AWS Technical Evangelist | Length: 14 minutes 17 seconds

In this tutorial video, you can find out how to use Amazon SageMaker Debugger to capture and inspect debugging information from an example training model. The example training model used in this video is a simple Convolutional Neural Network (CNN) based on Keras with the TensorFlow backend. You will see that SageMaker in Tensorflow framework and Debugger provide functionality to build an estimator directly using the training script and debug the training job.

You can find the example notebook in the video from this Studio Demo repository provided by the author. You need to clone the debugger.ipynb notebook file and a training script example file to your SageMaker Studio or a SageMaker Notebook Instance. After you clone the two files, specify the path keras_script_path to the file inside the debugger.ipynb notebook. If you cloned the two files in the same directory, this can be simply set as keras_script_path = "".

Deep Dive on Amazon SageMaker Debugger and Amazon SageMaker Model Monitor

Author: Julien Simon, AWS Technical Evangelist | Length: 44 minutes 34 seconds

In this video session, you will explore advanced features of Amazon SageMaker—Debugger and Model Monitor—that will help you boost productivity as well as the quality of your models. First, this video shows how to detect and fix training issues, visualize tensors, and improve models with SageMaker Debugger. In the second part of this video starting from 22:41, it shows how to monitor models in production and identify prediction issues such as missing features or data drift using SageMaker Model Monitor. Finally, you can find cost optimization tips to help you make the most of your machine learning budget.

You can fine the example notebook in the video from this AWS Dev Days 2020 repository offered by the author.