The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see Supported frameworks.
The following steps show you how to modify a TensorFlow training script to utilize SageMaker AI's distributed data parallel library.
The library APIs are designed to be similar to Horovod APIs. For additional details on
each API that the library offers for TensorFlow, see the SageMaker AI distributed data parallel TensorFlow API documentation
SageMaker AI distributed data parallel is adaptable to TensorFlow training scripts composed of
core modules except tf.keras
modules. SageMaker AI distributed data
parallel does not support TensorFlow with Keras implementation.
The SageMaker AI distributed data parallelism library supports Automatic Mixed Precision (AMP)
out of the box. No extra action is needed to enable AMP other than the framework-level
modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism
library runs its AllReduce
operation in FP16. For more information about
implementing AMP APIs to your training script, see the following resources:
Frameworks - TensorFlow
in the NVIDIA Deep Learning Performance documentation -
Automatic Mixed Precision for Deep Learning
in the NVIDIA Developer Docs -
TensorFlow mixed precision APIs
in the TensorFlow documentation
Import the library's TensorFlow client and initialize it.
import smdistributed.dataparallel.tensorflow as sdp sdp.init()
Pin each GPU to a single
process withlocal_rank
—this refers to the relative rank of the process within a given node. Thesdp.tensorflow.local_rank()
API provides you with the local rank of the device. The leader node is rank 0, and the worker nodes are rank 1, 2, 3, and so on. This is invoked in the following code block assdp.local_rank()
is not directly related to SageMaker AI distributed, but must be set for distributed training with TensorFlow.gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) if gpus: tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
Scale the learning rate by the number of workers. The
API provides you the number of workers in the cluster. This is invoked in the following code block assdp.size()
.learning_rate = learning_rate * sdp.size()
Use the library’s
to optimizeAllReduce
operations during training. This wrapstf.GradientTape
.with tf.GradientTape() as tape: output = model(input) loss_value = loss(label, output) # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape tape = sdp.DistributedGradientTape(tape)
Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks 1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use the
API after the model and optimizer variables are initialized. This is invoked in the following code block assdp.broadcast_variables()
.sdp.broadcast_variables(model.variables, root_rank=0) sdp.broadcast_variables(opt.variables(), root_rank=0)
Finally, modify your script to save checkpoints only on the leader node. The leader node has a synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly corrupting the checkpoints.
if sdp.rank() == 0:
The following is an example TensorFlow training script for distributed training with the library.
import tensorflow as tf
# SageMaker AI data parallel: Import the library TF API
import smdistributed.dataparallel.tensorflow as sdp
# SageMaker AI data parallel: Initialize the library
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
# SageMaker AI data parallel: Pin GPUs to a single library process
tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
# Prepare Dataset
dataset =
# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()
# SageMaker AI data parallel: Scale Learning Rate
# LR for 8 node run : 0.000125
# LR for single node run : 0.001
opt = tf.optimizers.Adam(0.000125 * sdp.size())
def training_step(images, labels, first_batch):
with tf.GradientTape() as tape:
probs = mnist_model(images, training=True)
loss_value = loss(labels, probs)
# SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
tape = sdp.DistributedGradientTape(tape)
grads = tape.gradient(loss_value, mnist_model.trainable_variables)
opt.apply_gradients(zip(grads, mnist_model.trainable_variables))
if first_batch:
# SageMaker AI data parallel: Broadcast model and optimizer variables
sdp.broadcast_variables(mnist_model.variables, root_rank=0)
sdp.broadcast_variables(opt.variables(), root_rank=0)
return loss_value
# SageMaker AI data parallel: Save checkpoints only from master node.
if sdp.rank() == 0:
After you have completed adapting your training script, move on to Launching distributed training jobs with SMDDP using the SageMaker Python SDK.