Cluster health check Auto-resume How to replace a faulty node not being auto-resumed by HyperPod

SageMaker HyperPod cluster resiliency

SageMaker HyperPod provides the following cluster resiliency features.

Topics

Cluster health check
Auto-resume
How to replace a faulty node not being auto-resumed by HyperPod

Cluster health check

This section describes the set of health checks that SageMaker HyperPod uses to regularly monitor cluster instance health for issues with devices such as accelerators (GPU and Trainium cores) and networking (EFA).

Category	Utility name	Instance type compatibility	Description
Accelerator	DCGM policies	GPU	Each instance in the cluster continuously monitors all GPU-related policies including XID errors with NVIDIA DCGM.
	NVIDIA SMI	GPU	nvidia-smi utility is a well-known CLI to manage and monitor GPUs. The built-in health checker parses the output from `nvidia-smi` to determine the health of the instance.
	Neuron sysfs	Trainium	For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from Neuron sysfs propagated directly by the Neuron driver.
Network	EFA	GPU and Trainium	To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance.
Stress	DCGM Diagnostic	GPU	DCGM diagnostics level 2 is used to exercise the GPUs in the system and put them under pressure to get a thorough insight of the health.
Stress	CPU stress	GPU and Trainium	CPU health is determined using the Linux stress tool, which runs multiple threads to achieve 100% CPU utilization and perform I/O operations.

Auto-resume

This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure for clusters with more than 16 nodes.

With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced.

Using the SageMaker HyperPod auto-resume functionality with Slurm

When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an exclusive allocation acquired either by using salloc or sbatch. In any case, you need to modify the entrypoint script to make sure that all setup steps run in a single srun command when resuming the job. Through the entrypoint script, it is important to set up the environment on the replaced node to be consistent with the environment that the job step was running before it was stopped. The following precedure shows how to prepare an entrypoint script to keep the environment consistent and run it as a single srun command.

Tip

If you use sbatch, you can keep the batch script simple by creating a separate script for setting up the environment and using a single srun command.

Create a script using the following code example and save it as train_auto_resume.sh. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.

Note

The following code example shows how to discover the Slurm node list associated with the job. Do not use the $SLURM_JOB_NODELIST environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a new NODE_LIST variable to replace SLURM_JOB_NODELIST, and then set up the MASTER_NODE and MASTER_ADDR variables off of the NODE_LIST variable.


#!/bin/bash

# Filename: train_auto_resume.sh
# Sample containerized script to launch a training job with a single srun which can be auto-resumed.

# Place your training environment setup here. 
# Example: Install conda, docker, activate virtual env, etc.

# Get the list of nodes for a given job
NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job
            awk -F= '/NodeList=/{print $2}' | \  # Extract NodeList field
            grep -v Exc)                         # Exclude nodes marked as excluded

# Determine the master node from the node list
MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames
              head -n 1)                            # Select the first hostname as master node

# Get the master node address
MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information
              awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr
              awk '{print $1}')                   # Print the first part of NodeAddr


# Torchrun command to launch the training job
torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \
                       --nproc_per_node=1 \
                       --node_rank=$SLURM_NODE \
                       --master-addr=$MASTER_ADDR \
                       --master_port=1234 \
                       <your_training_script.py>"

# Execute the torchrun command in the 'pytorch' Conda environment, 
# streaming output live
/opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmd

Tip

You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the set of lifecycle scripts that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.

Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag --auto-resume=1 to indicate that the srun command should be automatically retried in case of hardware failure.

Note
If you have set up a resource allocation using sbatch or salloc, you can run multiple srun commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current job step of the srun command with the flag --auto-resume=1. In other words, activating auto-resume in an srun command doesn't apply to other srun commands launched within a resource allocation session.

The following are srun command examples with auto-resume enabled.

Using sbatch

Because most of the logic for setting up the environment is already in train_auto_resume.sh, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved as batch.sh.
```
#!/bin/bash
#SBATCH --nodes 2
#SBATCH --exclusive
srun --auto-resume=1 train_auto_resume.sh
```
Run the preceding batch script using the following command.
```
sbatch batch.sh
```
Using salloc

Start by acquiring an exclusive allocation, and run the srun command with the --auto-resume flag and the entrypoint script.
```
salloc -N 2 --exclusive
srun --auto-resume=1 train_auto_resume.sh
```

How to replace a faulty node not being auto-resumed by HyperPod

The HyperPod auto-resume functionality monitors if the state of your Slurm nodes turns to fail or down. You can check the state of Slurm nodes by running sinfo.

If you have a node stuck with an issue but not being fixed by the HyperPod auto-resume functionality, we recommend you to run the following command to change the state of the node to fail.


scontrol update node=<ip-ipv4> state=fail reason="Action:Replace"

In the preceding command example, replace <ip-ipv4> with the Slurm node name (host name) of the faulty instance you want to replace.

After running this command, the node should go into the fail state, waits for the currently running jobs to finish, is replaced with a healthy instance, and is recovered with the same host name. This process takes time depending on the available instances in your Availability Zone and the time it takes to run your lifecycle scripts. During the update and replacement processes, avoid changing the state of the node manually again or restarting the Slurm controller; doing so can lead to a replacement failure. If the node does not get recovered nor turn to the idle state after a long time, contact AWS Support.

If the faulty node is continuously stuck in the fail state, the last resort you might try is to manually force change the node state to down. This requires administrator privileges (sudo permissions).

Warning

Proceed carefully before you run the following command as it forces kill all jobs, and you might lose all unsaved work.


scontrol update node=<ip-ipv4> state=down reason="Action:Replace"

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Exported metrics reference

Cluster management