Considerations for Distributed Training Considerations for Monitoring System Bottlenecks and Profiling Framework Operations Considerations for Debugging Model Output Tensors

Considerations for Amazon SageMaker Debugger

Consider the following when using Amazon SageMaker Debugger.

Considerations for Distributed Training

The following list shows the scope of validity and considerations for using Debugger on training jobs with deep learning frameworks and various distributed training options.

Horovod

Scope of validity of using Debugger for training jobs with Horovod
Deep Learning Framework	Apache MXNet	TensorFlow 1.x	TensorFlow 2.x	TensorFlow 2.x with Keras	PyTorch
Monitoring system bottlenecks	Yes	Yes	Yes	Yes	Yes
Profiling framework operations	No	No	No	Yes	Yes
Debugging model output tensors	Yes	Yes	Yes	Yes	Yes

SageMaker distributed data parallel

Scope of validity of using Debugger for training jobs with SageMaker distributed data parallel
Deep Learning Framework	TensorFlow 2.x	TensorFlow 2.x with Keras	PyTorch
Monitoring system bottlenecks	Yes	Yes	Yes
Profiling framework operations	No*	No**	Yes
Debugging model output tensors	Yes	Yes	Yes

* Debugger does not support framework profiling for TensorFlow 2.x.

** SageMaker distributed data parallel does not support TensorFlow 2.x with Keras implementation.

SageMaker distributed model parallel – Debugger does not support SageMaker distributed model parallel training.
Distributed training with SageMaker checkpoints – Debugger is not available for training jobs when both the distributed training option and SageMaker checkpoints are enabled. You might see an error that looks like the following:
```
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
```
To use Debugger for training jobs with distributed training options, you need to disable SageMaker checkpointing and add manual checkpointing functions to your training script. For more information about using Debugger with distributed training options and checkpoints, see Using SageMaker distributed data parallel with Amazon SageMaker Debugger and checkpoints and Saving Checkpoints.
Parameter Server – Debugger does not support parameter server-based distributed training.
Profiling distributed training framework operations, such as the AllReduced operation of SageMaker distributed data parallel and Horovod operations, is not available.

Considerations for Monitoring System Bottlenecks and Profiling Framework Operations

For AWS TensorFlow, data loader metrics cannot be collected using the default local_path setting of the FrameworkProfile class. The path has to be manually configured and end in "/". For example:
```
FrameworkProfile(local_path="/opt/ml/output/profiler/")
```
For AWS TensorFlow, the data loader profiling configuration cannot be updated while a training job is running.
For AWS TensorFlow, a NoneType error might occur when you use analysis tools and notebook examples with TensorFlow 2.3 training jobs and the detailed profiling option.
Python profiling and detailed profiling are only supported for Keras API.
To access the deep profiling feature for TensorFlow and PyTorch, currently you must specify the latest AWS deep learning container images with CUDA 11. For example, you must specify the specific image URI in the TensorFlow and PyTorch estimator as follows:
- For TensorFlow
```
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04"
```
- For PyTorch
```
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04"
```

Considerations for Debugging Model Output Tensors

Avoid using functional API operations. Debugger cannot collect model output tensors from PyTorch and MXNet training scripts composed of functional API operations.
- Debugger cannot collect model output tensors from the torch.nn.functional API operations. When you write a PyTorch training script, it is recommended to use the torch.nn modules instead.
- Debugger cannot collect model output tensors from MXNet functional objects in hybrid blocks. For example, the ReLu activation (F.relu) outputs cannot be collected from the following example of mxnet.gluon.HybridBlock with F in the hybrid_forward function.
```
import mxnet as mx
from mxnet.gluon import HybridBlock, nn

class Model(HybridBlock):
    def __init__(self, **kwargs):
        super(Model, self).__init__(**kwargs)
        # use name_scope to give child Blocks appropriate names.
        with self.name_scope():
            self.dense0 = nn.Dense(20)
            self.dense1 = nn.Dense(20)

    def hybrid_forward(self, F, x):
        x = F.relu(self.dense0(x))
        return F.relu(self.dense1(x))

model = Model()
model.initialize(ctx=mx.cpu(0))
model.hybridize()
model(mx.nd.zeros((10, 10), ctx=mx.cpu(0)))
```

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Exceptions

Debugger Usage Statistics