SageMaker HyperPod FAQ

Focus mode

SageMaker HyperPod FAQ - Amazon SageMaker AI

Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.

Q. Why can I not find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?

By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.

If you use the sample lifecycle scripts provided by the HyperPod service team, you can expect to find the lifecycle configuration logs written to /var/log/provision/provisioning.log, and you wouldn’t encounter this problem.

However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.

Option 1: Update your lifecycle scripts to write logs to /var/log/provision/provisioning.log.
Option 2: Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.
1. Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at /opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json. In the configuration file, find the field name logs.logs_collected.files.collect_list.file_path. With the default setup by HyperPod, the key-value pair should be "file_path": "/var/log/provision/provisioning.log" as documented at Logging SageMaker HyperPod at instance level. The following code snippet shows how the JSON file looks with the HyperPod default configuration.
```
"logs": {
    "logs_collected": {
        "files": {
            "collect_list": [
                {
                    "file_path": "/var/log/provision/provisioning.log",
                    "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]",
                    "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}",
                    "retention_in_days": -1
                }
            ]
        }
    },
    "force_flush_interval": 3
}
```
2. Replace the value for the "file_path" field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to /var/log/custom-provision/custom-provisioning.log, update the value to match with it as follows.
```
"file_path": "/var/log/custom-provision/custom-provisioning.log"
```
3. Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also Troubleshooting the CloudWatch agent.
```
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
    -a fetch-config -m ec2 -s -c \
    file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
```

Q. What particular configurations does HyperPod manage in Slurm configuration files such as slurm.conf and gres.conf?

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the slurm.conf and gres.conf files at /opt/slurm/etc/ to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows what specific parameters the HyperPod agent handles and overwrites.

Important

We strongly recommend that you DON’T change these parameters managed by HyperPod.

In slurm.conf, HyperPod sets up the following basic parameters: ClusterName, SlurmctldHost, PartitionName, and NodeName.

Also, to enable the Auto-resume functionality, HyperPod requires the TaskPlugin and SchedulerParameters parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.
```
TaskPlugin=task/none
SchedulerParameters=permit_job_expansion
```
In gres.conf, HyperPod manages NodeName for GPU nodes.

Q. How do I run Docker on Slurm nodes on HyperPod?

To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see Start with base lifecycle scripts provided by HyperPod and Run Docker containers on a Slurm compute node on HyperPod.

Q. Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) on SageMaker HyperPod platform on the Slurm framework?

By default, the Linux OS sets the #RemoveIPC=yes flag. Slurm and mpirun jobs that use NCCL generate inter-process communication (IPC) resources under non-root user sessions. These user sessions might log out during the job process.

When you run jobs with Slurm or mpirun, if systemd detects that the user isn't logged in, it cleans up the IPC resources. Slurm and mpirun jobs can run without the user being logged in, but that requires that you disable cleanup at the systemd level and set it up at the Slurm level instead. For more information, see Systemd in the NCCL documentation.

To disable cleanup at the systemd level, complete the following steps.

Set the flag #RemoveIPC=no in the file /etc/systemd/logind.conf if you're running training jobs that use Slurm and NCCL.

By default, Slurm doesn't clean up shared resources. We recommend that you set up a Slurm epilog script to clean up shared resources. This cleanup is useful when you have a lot of shared resources and want to clean them up after training jobs. The following is an example script.


#!/bin/bash
: <<'SUMMARY'
Script: epilog.sh

Use this script with caution, as it can potentially delete unnecessary resources and cause issues if you don't use it correctly.

Note: You must save this script in a shared in a shared location that is accessible to all nodes in the cluster, such as /fsx volume.
Workers must be able to access the script to run the script after jobs.

SUMMARY

# Define the log directory and create it if it doesn't exist
LOG_DIR="/<PLACEHOLDER>/epilogue" #NOTE: Update PLACEHOLDER to be a shared value path, such as /fsx/epilogue.
mkdir -p "$LOG_DIR"

# Name the log file using the Slurm job name and job ID
log_file="$LOG_DIR/epilogue-${SLURM_JOB_NAME}_${SLURM_JOB_ID}.log"

logging() {
    echo "[$(date)] $1" | tee -a "$log_file"
}

# Slurm epilogue script to clean up IPC resources
logging "Starting IPC cleanup for Job $SLURM_JOB_ID"

# Clean up shared memory segments by username
for seg in $(ipcs -m | awk -v owner="$SLURM_JOB_USER" '$3 == owner {print $2}'); do
    if ipcrm -m "$seg"; then
        logging "Removed shared memory segment $seg"
    else
        logging "Failed to remove shared memory segment $seg"
    fi
done

# Clean up semaphores by username
for sem in $(ipcs -s | awk -v user="$SLURM_JOB_USER" '$3 == user {print $2}'); do
    if ipcrm -s "$sem"; then
        logging "Removed semaphore $sem"
    else
        logging "Failed to remove semaphore $sem"
    fi
done

# Clean up NCCL IPC
NCCL_IPC_PATH="/dev/shm/nccl-*"
for file in $NCCL_IPC_PATH; do
    if [ -e "$file" ]; then
        if rm "$file"; then
            logging "Removed NCCL IPC file $file"
        else
            logging "Failed to remove NCCL IPC file $file"
        fi
    fi
done
logging "IPC cleanup completed for Job $SLURM_JOB_ID"
exit 0

For more information about the Epilog parameter, see Slurm documentation.

In the slurm.conf file from the controller node, add in a line to point to the epilog script you created.
```
Epilog="/path/to/epilog.sh"  #For example: /fsx/epilogue/epilog.sh
```
Run the following commands to change permissions of the script and to make it executable.
```
chown slurm:slurm /path/to/epilog.sh
chmod +x  /path/to/epilog.sh
```
To apply all of your changes, run scontrol reconfigure.

Q. How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?

Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see Run Docker containers on a Slurm compute node on HyperPod.

Q. How to set up EFA security groups?

If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Q. How do I monitor my HyperPod cluster nodes? Is there any CloudWatch metrics exported from HyperPod?

To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see SageMaker HyperPod cluster resources monitoring. Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.

Q. Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.

If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the release on June 20, 2024, you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Cluster management

Orchestrating HyperPod clusters with Amazon EKS

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

SageMaker HyperPod FAQ

Important

Related resources

Did this page help you?

Related resources

Next topic:

Previous topic:

Need help?