Troubleshooting for distributed training in Amazon SageMaker - Amazon SageMaker

Troubleshooting for distributed training in Amazon SageMaker

If you have problems in running a training job when you use the library, use the following list to try to troubleshoot. If you need further support, reach out to the SageMaker team through AWS Support Center or AWS Developer Forums for Amazon Amazon SageMaker.

Using SageMaker distributed data parallel with Amazon SageMaker Debugger and checkpoints

To monitor system bottlenecks, profile framework operations, and debug model output tensors for training jobs with SageMaker distributed data parallel, use Amazon SageMaker Debugger.

However, when you use SageMaker Debugger, SageMaker distributed data parallel, and SageMaker checkpoints, you might see an error that looks like the following example.

SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled

This is due to an internal error between Debugger and checkpoints, which occurs when you enable SageMaker distributed data parallel.

  • If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing debugger_hook_config=False, which is equivalent to the following framework estimator example.

    bucket=sagemaker.Session().default_bucket() base_job_name="sagemaker-checkpoint-test" checkpoint_in_bucket="checkpoints" # The S3 URI to store the checkpoints checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket) estimator = TensorFlow( ... distribution={"smdistributed": {"dataparallel": { "enabled": True }}}, checkpoint_s3_uri=checkpoint_s3_bucket, checkpoint_local_path="/opt/ml/checkpoints", debugger_hook_config=False )
  • If you want to keep using both SageMaker distributed data parallel and SageMaker Debugger, a workaround is manually adding checkpointing functions to your training script instead of specifying the checkpoint_s3_uri and checkpoint_local_path parameters from the estimator. For more information about setting up manual checkpointing in a training script, see Saving Checkpoints.

An unexpected prefix attached to model parameter keys

For PyTorch distributed training jobs, an unexpected prefix (model for example) might be attached to state_dict keys (model parameters). The SageMaker data parallel library does not directly alter or prepend any model parameter names when PyTorch training jobs save model artifacts. The PyTorch's distributed training changes the names in the state_dict to go over the network, prepending the prefix. If you encounter any model failure problem due to different parameter names while you are using the SageMaker data parallel library and checkpointing for PyTorch training, adapt the following example code to remove the prefix at the step you load checkpoints in your training script.

state_dict = {k.partition('model.')[2]:state_dict[k] for k in state_dict.keys()}

This takes each state_dict key as a string value, separates the string at the first occurrence of 'model.', and takes the third list item (with index 2) of the partitioned string.

For more information about the prefix issue, see a discussion thread at Prefix parameter names in saved model if trained by multi-GPU? in the PyTorch discussion forum.

For more information about the PyTorch methods for saving and loading models, see Saving & Loading Model Across Devices in the PyTorch documentation.

SageMaker distributed training job stalling during initialization

If your SageMaker distributed data parallel training job stalls during initialization when using EFA-enabled instances, this might be due to a misconfiguration in the security group of the VPC subnet that's used for the training job. EFA requires a proper security group configuration to enable traffic between the nodes.

To configure inbound and outbound rules for the security group
  1. Sign in to the AWS Management Console and open the Amazon VPC console at

  2. Choose Security Groups in the left navigation pane.

  3. Select the security group that's tied to the VPC subnet you use for training.

  4. In the Details section, copy the Security group ID.

  5. On the Inbound rules tab, choose Edit inbound rules.

  6. On the Edit inbound rules page, do the following:

    1. Choose Add rule.

    2. For Type, choose All traffic.

    3. For Source, choose Custom, paste the security group ID into the search box, and select the security group that pops up.

  7. Choose Save rules to finish configuring the inbound rule for the security group.

  8. On the Outbound rules tab, choose Edit outbound rules.

  9. Repeat the step 6 and 7 to add the same rule as an outbound rule.

After you complete the preceding steps for configuring the security group with the inbound and outbound rules, re-run the training job and verify if the stalling issue is resolved.

For more information about configuring security groups for VPC and EFA, see Security groups for your VPC and Elastic Fabric Adapter.

SageMaker distributed training job stalling at the end of training

One of the root causes of stalling issues at the end of training is a mismatch in the number of batches that are processed per epoch across different ranks. All workers (GPUs) synchronize their local gradients in the backward pass to ensure they all have the same copy of the model at the end of the batch iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of training, the training job stalls. For example, while a group of workers (group A) finishes processing all batches and exits the training loop, another group of workers (group B) starts processing another batch and still expects communication from group A to synchronize the gradients. This causes group B to wait for group A, which already completed training and does not have any gradients to synchronize.

Therefore, when setting up your training dataset, it is important that each worker gets the same number of data samples so that each worker goes through the same number of batches while training. Make sure each rank gets the same number of batches to avoid this stalling issue.

Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks

One potential cause of lowered scaling efficiency is the FSx throughput limit. If you observe a sudden drop in scaling efficiency when you switch to a larger training cluster, try using a larger FSx for Lustre file system with a higher throughput limit. For more information, see Aggregate file system performance and Managing storage and throughput capacity in the Amazon FSx for Lustre User Guide.

SageMaker distributed training job with PyTorch returns deprecation warnings

Since v1.4.0, the SageMaker distributed data parallelism library works as a backend of PyTorch distributed. Because of the breaking change of using the library with PyTorch, you might encounter a warning message that the smdistributed APIs for the PyTorch distributed package are deprecated. The warning message should be similar to the following:

smdistributed.dataparallel.torch.dist is deprecated in the SageMaker distributed data parallel library v1.4.0+. Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows: torch.distributed.init_process_group(backend='smddp') For more information, see the library's API documentation at

In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set as the backend during the PyTorch distributed initialization. With the single line of backend specification, you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules. See Use the SMDDP library in your PyTorch training script to learn about the breaking changes and the new way to use the library with PyTorch.