Debug and improve model performance - Amazon SageMaker

Debug and improve model performance

The essence of training machine learning models, deep learning neural networks, transformer models is in achieving stable model convergence, and as such, state-of-the-art models have millions, billions, or trillions of model parameters. The number of operations to update the gigantic number of model parameters during each iteration can easily become astronomical. To identify model convergence issues, it is important to be able to access the model parameters, activations, and gradients computed during optimization processes.

Amazon SageMaker provides two debugging tools to help identify such convergence issues and gain visibility into your models.

Amazon SageMaker with TensorBoard

To offer a greater compatiblity with the open-source community tools within the SageMaker Training platform, SageMaker hosts TensorBoard as an application in SageMaker domain. You can bring your training jobs to SageMaker and keep using the TensorBoard summary writer to collect the model output tensors. Because TensorBoard is implemented into SageMaker domain, it also gives you more options to manage user profiles under the SageMaker domain in your AWS account, and provides fine control over the user profiles by granting access to specific actions and resources. To learn more, see Use TensorBoard to debug and analyze training jobs in Amazon SageMaker.

Amazon SageMaker Debugger

Amazon SageMaker Debugger is a capability of SageMaker that provides tools to register hooks to callbacks to extract model output tensors and save them in Amazon Simple Storage Service. It provides built-in rules for detecting model convergence issues, such as overfitting, saturated activation functions, vanishing gradients, and more. You can also set up the built-in rules with Amazon CloudWatch Events and AWS Lambda for taking automated actions against detected issues, and set up Amazon Simple Notification Service to receive email or text notifications. To learn more, see Use Amazon SageMaker Debugger to debug and improve model performance.