Considerations for Amazon SageMaker Debugger
Consider the following when using Amazon SageMaker Debugger.
Considerations for Distributed Training
The following list shows the scope of validity and considerations for using Debugger on training jobs with deep learning frameworks and various distributed training options.
-
Horovod
Scope of validity of using Debugger for training jobs with HorovodDeep Learning Framework Apache MXNet TensorFlow 1.x TensorFlow 2.x TensorFlow 2.x with Keras PyTorch Monitoring system bottlenecks Yes Yes Yes Yes Yes Profiling framework operations No No No Yes Yes Debugging model output tensors Yes Yes Yes Yes Yes -
SageMaker distributed data parallel
Scope of validity of using Debugger for training jobs with SageMaker distributed data parallelDeep Learning Framework TensorFlow 2.x TensorFlow 2.x with Keras PyTorch Monitoring system bottlenecks Yes Yes Yes Profiling framework operations No* No** Yes Debugging model output tensors Yes Yes Yes * Debugger does not support framework profiling for TensorFlow 2.x.
** SageMaker distributed data parallel does not support TensorFlow 2.x with Keras implementation.
-
SageMaker distributed model parallel – Debugger does not support SageMaker distributed model parallel training.
-
Distributed training with SageMaker checkpoints – Debugger is not available for training jobs when both the distributed training option and SageMaker checkpoints are enabled. You might see an error that looks like the following:
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
To use Debugger for training jobs with distributed training options, you need to disable SageMaker checkpointing and add manual checkpointing functions to your training script. For more information about using Debugger with distributed training options and checkpoints, see Using SageMaker distributed data parallel with Amazon SageMaker Debugger and checkpoints and Saving Checkpoints.
-
Parameter Server – Debugger does not support parameter server-based distributed training.
-
Profiling distributed training framework operations, such as the
AllReduced
operation of SageMaker distributed data parallel and Horovod operations, is not available.
Considerations for Monitoring System Bottlenecks and Profiling Framework Operations
-
For AWS TensorFlow, data loader metrics cannot be collected using the default
local_path
setting of theFrameworkProfile
class. The path has to be manually configured and end in"/"
. For example:FrameworkProfile(local_path="/opt/ml/output/profiler/")
-
For AWS TensorFlow, the data loader profiling configuration cannot be updated while a training job is running.
-
For AWS TensorFlow, a
NoneType
error might occur when you use analysis tools and notebook examples with TensorFlow 2.3 training jobs and the detailed profiling option. -
Python profiling and detailed profiling are only supported for Keras API.
-
To access the deep profiling feature for TensorFlow and PyTorch, currently you must specify the latest AWS deep learning container images with CUDA 11. For example, you must specify the specific image URI in the TensorFlow and PyTorch estimator as follows:
-
For TensorFlow
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04"
-
For PyTorch
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04"
-
Considerations for Debugging Model Output Tensors
-
Avoid using functional API operations. Debugger cannot collect model output tensors from PyTorch and MXNet training scripts composed of functional API operations.
-
Debugger cannot collect model output tensors from the
torch.nn.functional
API operations. When you write a PyTorch training script, it is recommended to use the torch.nn
modules instead. -
Debugger cannot collect model output tensors from MXNet functional objects in hybrid blocks. For example, the ReLu activation (
F.relu
) outputs cannot be collected from the following example ofmxnet.gluon.HybridBlock
with F
in thehybrid_forward
function.import mxnet as mx from mxnet.gluon import HybridBlock, nn class Model(HybridBlock): def __init__(self, **kwargs): super(Model, self).__init__(**kwargs) # use name_scope to give child Blocks appropriate names. with self.name_scope(): self.dense0 = nn.Dense(20) self.dense1 = nn.Dense(20) def hybrid_forward(self, F, x): x = F.relu(self.dense0(x)) return F.relu(self.dense1(x)) model = Model() model.initialize(ctx=mx.cpu(0)) model.hybridize() model(mx.nd.zeros((10, 10), ctx=mx.cpu(0)))
-