Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.
Q. Why can I not find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.
If you use the sample
lifecycle scripts provided by the HyperPod service team, you can expect
to find the lifecycle configuration logs written to
/var/log/provision/provisioning.log
, and you wouldn’t encounter this
problem.
However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.
-
Option 1: Update your lifecycle scripts to write logs to
/var/log/provision/provisioning.log
. -
Option 2: Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.
-
Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at
/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
. In the configuration file, find the field namelogs.logs_collected.files.collect_list.file_path
. With the default setup by HyperPod, the key-value pair should be"file_path": "/var/log/provision/provisioning.log"
as documented at Logging SageMaker HyperPod at instance level. The following code snippet shows how the JSON file looks with the HyperPod default configuration."logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/provision/provisioning.log", "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]", "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}", "retention_in_days": -1 } ] } }, "force_flush_interval": 3 }
-
Replace the value for the
"file_path"
field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to/var/log/custom-provision/custom-provisioning.log
, update the value to match with it as follows."file_path": "
/var/log/custom-provision/custom-provisioning.log
" -
Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also Troubleshooting the CloudWatch agent.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a fetch-config -m ec2 -s -c \ file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
-
Q. What particular configurations does HyperPod manage in
Slurm configuration files such as slurm.conf
and
gres.conf
?
When you create a Slurm cluster on HyperPod, the HyperPod agent sets up
the slurm.conf
gres.conf
/opt/slurm/etc/
to manage the Slurm cluster based on your
HyperPod cluster creation request and lifecycle scripts. The following list shows
what specific parameters the HyperPod agent handles and overwrites.
Important
We strongly recommend that you DON’T change these parameters managed by HyperPod.
-
In
slurm.conf
, HyperPod sets up the following basic parameters: ClusterName
,SlurmctldHost
,PartitionName
, andNodeName
.Also, to enable the Auto-resume functionality, HyperPod requires the
TaskPlugin
andSchedulerParameters
parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.TaskPlugin=task/none SchedulerParameters=permit_job_expansion
-
In
gres.conf
, HyperPod manages NodeName
for GPU nodes.
Q. How do I run Docker on Slurm nodes on HyperPod?
To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see Start with base lifecycle scripts provided by HyperPod and Run Docker containers on a Slurm compute node on HyperPod.
Q. Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) on SageMaker HyperPod platform on the Slurm framework?
By default, the Linux OS sets the #RemoveIPC=yes
flag. Slurm and mpirun
jobs that use NCCL generate inter-process communication (IPC) resources under non-root user sessions. These user sessions
might log out during the job process.
When you run jobs with Slurm or mpirun, if systemd
detects that the user isn't logged in, it cleans up the IPC resources.
Slurm and mpirun jobs can run without the
user being logged in, but that requires that you disable cleanup at the systemd level and set it
up at the Slurm level instead. For more information, see
Systemd in the NCCL documentation
To disable cleanup at the systemd level, complete the following steps.
-
Set the flag
#RemoveIPC=no
in the file/etc/systemd/logind.conf
if you're running training jobs that use Slurm and NCCL. -
By default, Slurm doesn't clean up shared resources. We recommend that you set up a Slurm epilog script to clean up shared resources. This cleanup is useful when you have a lot of shared resources and want to clean them up after training jobs. The following is an example script.
#!/bin/bash : <<'SUMMARY' Script: epilog.sh Use this script with caution, as it can potentially delete unnecessary resources and cause issues if you don't use it correctly. Note: You must save this script in a shared in a shared location that is accessible to all nodes in the cluster, such as
/fsx volume
. Workers must be able to access the script to run the script after jobs. SUMMARY # Define the log directory and create it if it doesn't exist LOG_DIR="/<PLACEHOLDER>
/epilogue" #NOTE: UpdatePLACEHOLDER
to be a shared value path, such as/fsx/epilogue
. mkdir -p "$LOG_DIR" # Name the log file using the Slurm job name and job ID log_file="$LOG_DIR/epilogue-${SLURM_JOB_NAME}_${SLURM_JOB_ID}.log" logging() { echo "[$(date)] $1" | tee -a "$log_file" } # Slurm epilogue script to clean up IPC resources logging "Starting IPC cleanup for Job $SLURM_JOB_ID" # Clean up shared memory segments by username for seg in $(ipcs -m | awk -v owner="$SLURM_JOB_USER" '$3 == owner {print $2}'); do if ipcrm -m "$seg"; then logging "Removed shared memory segment $seg" else logging "Failed to remove shared memory segment $seg" fi done # Clean up semaphores by username for sem in $(ipcs -s | awk -v user="$SLURM_JOB_USER" '$3 == user {print $2}'); do if ipcrm -s "$sem"; then logging "Removed semaphore $sem" else logging "Failed to remove semaphore $sem" fi done # Clean up NCCL IPC NCCL_IPC_PATH="/dev/shm/nccl-*" for file in $NCCL_IPC_PATH; do if [ -e "$file" ]; then if rm "$file"; then logging "Removed NCCL IPC file $file" else logging "Failed to remove NCCL IPC file $file" fi fi done logging "IPC cleanup completed for Job $SLURM_JOB_ID" exit 0For more information about the Epilog parameter, see Slurm documentation
. -
In the
slurm.conf
file from the controller node, add in a line to point to the epilog script you created.Epilog="/path/to/epilog.sh" #For example: /fsx/epilogue/epilog.sh
-
Run the following commands to change permissions of the script and to make it executable.
chown slurm:slurm /path/to/epilog.sh chmod +x /path/to/epilog.sh
-
To apply all of your changes, run
scontrol reconfigure
.
Q. How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see Run Docker containers on a Slurm compute node on HyperPod.
Q. How to set up EFA security groups?
If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.
Q. How do I monitor my HyperPod cluster nodes? Is there any CloudWatch metrics exported from HyperPod?
To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see SageMaker HyperPod cluster resources monitoring. Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.
Q. Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the release on June 20, 2024, you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.