

# SageMaker HyperPod multi-head node support
Multi-head node support

You can create multiple controller (head) nodes in a single SageMaker HyperPod Slurm cluster, with one serving as the primary controller node and the others serving as backup controller nodes. The primary controller node is responsible for controlling the compute (worker) nodes and handling Slurm operations. The backup controller nodes constantly monitor the primary controller node. If the primary controller node fails or becomes unresponsive, one of the backup controller nodes will automatically take over as the new primary controller node.

Configuring multiple controller nodes in SageMaker HyperPod Slurm clusters provides several key benefits. It eliminates the risk of single controller node failure by providing controller head nodes, enables automatic failover to backup controller nodes with faster recovery, and allows you to manage your own accounting databases and Slurm configuration independently.

## Key concepts


The following provides details about the concepts related to SageMaker HyperPod multiple controller (head) nodes support for Slurm clusters.

**Controller node**

A controller node is an Amazon EC2 instance within a cluster that runs critical Slurm services for managing and coordinating the cluster's operations. Specifically, it hosts the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) and the [Slurm database daemon (slurmdbd)](https://slurm.schedmd.com/slurmdbd.html). A controller node is also known as a head node.

**Primary controller node**

A primary controller node is the active and currently controlling controller node in a Slurm cluster. It is identified by Slurm as the primary controller node responsible for managing the cluster. The primary controller node receives and executes commands from users to control and allocate resources on the compute nodes for running jobs.

**Backup controller node**

A backup controller node is an inactive and standby controller node in a Slurm cluster. It is identified by Slurm as a backup controller node that is not currently managing the cluster. The backup controller node runs the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) in standby mode. Any controller commands executed on the backup controller nodes will be propagated to the primary controller node for execution. Its primary purpose is to continuously monitor the primary controller node and take over its responsibilities if the primary controller node fails or becomes unresponsive.

**Compute node**

A compute node is an Amazon EC2 instance within a cluster that hosts the [Slurm worker daemon (slurmd)](https://slurm.schedmd.com/slurmd.html). The compute node's primary function is to execute jobs assigned by the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) running on the primary controller node. When a job is scheduled, the compute node receives instructions from the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) to carry out the necessary tasks and computations for that job within the node itself. A compute is also known as a worker node.

## How it works


The following diagram illustrates how different AWS services work together to support the multiple controller (head) nodes architecture for SageMaker HyperPod Slurm clusters.

![\[SageMaker HyperPod multi-head nodes architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-multihead-architecture.png)


The AWS services that work together to support the SageMaker HyperPod multiple controller (head) nodes architecture include the following.


**AWS services that work together to support the SageMaker HyperPod multiple controller nodes architecture**  

| Service | Description | 
| --- | --- | 
| IAM (AWS Identity and Access Management) | Defines two IAM roles to control the access permissions: one role for the compute node instance group and the other for the controller node instance group. | 
| Amazon RDS for MariaDB | Stores accounting data for Slurm, which holds job records and metering data. | 
| AWS Secrets Manager | Stores and manages credentials that can be accessed by Amazon FSx for Lustre. | 
| Amazon FSx for Lustre  | Stores Slurm configurations and runtime state. | 
| Amazon VPC | Provides an isolated network environment where the HyperPod cluster and its resources are deployed. | 
| Amazon SNS  | Sends notifications to administrators when there are status changes (Slurm controller is ON or OFF) related to the primary controller (head) node. | 

The HyperPod cluster itself consists of controller nodes (primary and backup) and compute nodes. The controller nodes run the Slurm controller (SlurmCtld) and database (SlurmDBd) components, which manage and monitor the workload across the compute nodes.

The controller nodes access Slurm configurations and runtime state stored in the Amazon FSx for Lustre file system. The Slurm accounting data is stored in the Amazon RDS for MariaDB database. AWS Secrets Manager provides secure access to the database credentials for the controller nodes.

If there is a status change (Slurm controller is `ON` or `OFF`) in the Slurm controller nodes, Amazon SNS sends notifications to the admin for further action.

This multiple controller nodes architecture eliminates the single point of failure of a single controller (head) node, enables fast and automatic failover recovery, and gives you control over the Slurm accounting database and configurations.

# Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster
Setting up multiple controller nodes

This topic explains how to configure multiple controller (head) nodes in a SageMaker HyperPod Slurm cluster using lifecycle scripts. Before you start, review the prerequisites listed in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) and familiarize yourself with the lifecycle scripts in [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The instructions in this topic use AWS CLI commands in Amazon Linux environment. Note that the environment variables used in these commands are available in the current session unless explicitly preserved.

**Topics**
+ [

# Provisioning resources using CloudFormation stacks
](sagemaker-hyperpod-multihead-slurm-cfn.md)
+ [

# Creating and attaching an IAM policy
](sagemaker-hyperpod-multihead-slurm-iam.md)
+ [

# Preparing and uploading lifecycle scripts
](sagemaker-hyperpod-multihead-slurm-scripts.md)
+ [

# Creating a SageMaker HyperPod cluster
](sagemaker-hyperpod-multihead-slurm-create.md)
+ [

# Considering important notes
](sagemaker-hyperpod-multihead-slurm-notes.md)
+ [

# Reviewing environment variables reference
](sagemaker-hyperpod-multihead-slurm-variables-reference.md)

# Provisioning resources using CloudFormation stacks
Provisioning resources

To set up multiple controller nodes in a HyperPod Slurm cluster, provision AWS resources through two CloudFormation stacks: [Provision basic resources](#sagemaker-hyperpod-multihead-slurm-cfn-basic) and [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

## Provision basic resources


Follow these steps to provision basic resources for your Amazon SageMaker HyperPod Slurm cluster.

1. Download the [sagemaker-hyperpod.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) template file to your machine. This YAML file is an CloudFormation template that defines the following resources to create for your Slurm cluster.
   + An execution IAM role for the compute node instance group
   + An Amazon S3 bucket to store the lifecycle scripts
   + Public and private subnets (private subnets have internet access through NAT gateways)
   + Internet Gateway/NAT gateways
   + Two Amazon EC2 security groups
   + An Amazon FSx volume to store configuration files

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod`. Define the Availability Zone (AZ) IDs for your cluster in `PrimarySubnetAZ` and `BackupSubnetAZ`. For example, *use1-az4* is an AZ ID for an Availability Zone in the `us-east-1` Region. For more information, see [Availability Zone IDs](https://docs.aws.amazon.com//ram/latest/userguide/working-with-az-ids.html) and [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/sagemaker-hyperpod.yaml \
   --stack-name sagemaker-hyperpod \
   --parameter-overrides PrimarySubnetAZ=use1-az4 BackupSubnetAZ=use1-az1 \
   --capabilities CAPABILITY_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod
   ```

1. (Optional) Verify the stack in the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod`) outputs. You will use values of these variables to [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

   ```
   source .env
   PRIMARY_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`PrimaryPrivateSubnet`].OutputValue' --output text)
   BACKUP_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`BackupPrivateSubnet`].OutputValue' --output text)
   EMAIL=$(bash -c 'read -p "INPUT YOUR SNSSubEmailAddress HERE: " && echo $REPLY')
   DB_USER_NAME=$(bash -c 'read -p "INPUT YOUR DB_USER_NAME HERE: " && echo $REPLY')
   SECURITY_GROUP=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`SecurityGroup`].OutputValue' --output text)
   ROOT_BUCKET_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonS3BucketName`].OutputValue' --output text)
   SLURM_FSX_DNS_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemDNSname`].OutputValue' --output text)
   SLURM_FSX_MOUNT_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemMountname`].OutputValue' --output text)
   COMPUTE_NODE_ROLE=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonSagemakerClusterExecutionRoleArn`].OutputValue' --output text)
   ```

   When you see prompts asking for your email address and database user name, enter values like the following.

   ```
   INPUT YOUR SNSSubEmailAddress HERE: Email_address_to_receive_SNS_notifications
   INPUT YOUR DB_USER_NAME HERE: Database_user_name_you_define
   ```

   To verify variable values, use the `print $variable` command.

   ```
   print $REGION
   us-east-1
   ```

## Provision additional resources to support multiple controller nodes


Follow these steps to provision additional resources for your Amazon SageMaker HyperPod Slurm cluster with multiple controller nodes.

1. Download the [sagemaker-hyperpod-slurm-multi-headnode.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod-slurm-multi-headnode.yaml) template file to your machine. This second YAML file is an CloudFormation template that defines the additional resources to create for multiple controller nodes support in your Slurm cluster.
   + An execution IAM role for the controller node instance group
   + An Amazon RDS for MariaDB instance
   + An Amazon SNS topic and subscription
   + AWS Secrets Manager credentials for Amazon RDS for MariaDB

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod-mh`. This second stack uses the CloudFormation template to create additional AWS resources to support the multiple controller nodes architecture.

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/slurm-multi-headnode.yaml \
   --stack-name sagemaker-hyperpod-mh \
   --parameter-overrides \
   SlurmDBSecurityGroupId=$SECURITY_GROUP \
   SlurmDBSubnetGroupId1=$PRIMARY_SUBNET \
   SlurmDBSubnetGroupId2=$BACKUP_SUBNET \
   SNSSubEmailAddress=$EMAIL \
   SlurmDBUsername=$DB_USER_NAME \
   --capabilities CAPABILITY_NAMED_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod-mh
   ```

1. (Optional) Verify the stack in the [AWS Cloud Formation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod-mh**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod-mh`) outputs. You will use values of these variables to update the configuration file (`provisioning_parameters.json`) in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md).

   ```
   source .env
   SLURM_DB_ENDPOINT_ADDRESS=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBEndpointAddress`].OutputValue' --output text)
   SLURM_DB_SECRET_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBSecretArn`].OutputValue' --output text)
   SLURM_EXECUTION_ROLE_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmExecutionRoleArn`].OutputValue' --output text)
   SLURM_SNS_FAILOVER_TOPIC_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmFailOverSNSTopicArn`].OutputValue' --output text)
   ```

# Creating and attaching an IAM policy
Creating and attaching an IAM policy

This section explains how to create an IAM policy and attach it to the execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

1. Download the [IAM policy example](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/1.AmazonSageMakerClustersExecutionRolePolicy.json) to your machine from the GitHub repository.

1. Create an IAM policy with the downloaded example, using the [create-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/create-policy.html) CLI command.

   ```
   aws --region us-east-1 iam create-policy \
       --policy-name AmazonSagemakerExecutionPolicy \
       --policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json
   ```

   Example output of the command.

   ```
   {
       "Policy": {
           "PolicyName": "AmazonSagemakerExecutionPolicy",
           "PolicyId": "ANPAXISIWY5UYZM7WJR4W",
           "Arn": "arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy",
           "Path": "/",
           "DefaultVersionId": "v1",
           "AttachmentCount": 0,
           "PermissionsBoundaryUsageCount": 0,
           "IsAttachable": true,
           "CreateDate": "2025-01-22T20:01:21+00:00",
           "UpdateDate": "2025-01-22T20:01:21+00:00"
       }
   }
   ```

1. Attach the policy `AmazonSagemakerExecutionPolicy` to the Slurm execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead), using the [attach-role-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/attach-role-policy.html) CLI command.

   ```
   aws --region us-east-1 iam attach-role-policy \
       --role-name AmazonSagemakerExecutionRole \
       --policy-arn arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy
   ```

   This command doesn't produce any output.

   (Optional) If you use environment variables, here are the example commands.
   + To get the role name and policy name 

     ```
     POLICY=$(aws --region $REGION iam list-policies --query 'Policies[?PolicyName==AmazonSagemakerExecutionPolicy].Arn' --output text)
     ROLENAME=$(aws --region $REGION iam list-roles --query "Roles[?Arn=='${SLURM_EXECUTION_ROLE_ARN}'].RoleName" —output text)
     ```
   + To attach the policy

     ```
     aws  --region us-east-1 iam attach-role-policy \
          --role-name $ROLENAME --policy-arn $POLICY
     ```

For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

# Preparing and uploading lifecycle scripts
Preparing and uploading lifecycle scripts

After creating all the required resources, you'll need to set up [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) for your SageMaker HyperPod cluster. These [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) provide a [base configuration](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) you can use to create a basic HyperPod Slurm cluster.

## Prepare the lifecycle scripts


Follow these steps to get the lifecycle scripts.

1. Download the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) from the GitHub repository to your machine.

1. Upload the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) to the Amazon S3 bucket you created in [Provision basic resources](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-basic), using the [cp](https://docs.aws.amazon.com//cli/latest/reference/s3/cp.html) CLI command.

   ```
   aws s3 cp --recursive LifeCycleScripts/base-config s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config
   ```

## Create configuration file


Follow these steps to create the configuration file and upload it to the same Amazon S3 bucket where you store the lifecycle scripts.

1. Create a configuration file named `provisioning_parameters.json` with the following configuration. Note that `slurm_sns_arn` is optional. If not provided, HyperPod will not set up the Amazon SNS notifications.

   ```
   cat <<EOF > /tmp/provisioning_parameters.json
   {
     "version": "1.0.0",
     "workload_manager": "slurm",
     "controller_group": "$CONTOLLER_IG_NAME",
     "login_group": "my-login-group",
     "worker_groups": [
       {
         "instance_group_name": "$COMPUTE_IG_NAME",
         "partition_name": "dev"
       }
     ],
     "fsx_dns_name": "$SLURM_FSX_DNS_NAME",
     "fsx_mountname": "$SLURM_FSX_MOUNT_NAME",
     "slurm_configurations": {
       "slurm_database_secret_arn": "$SLURM_DB_SECRET_ARN",
       "slurm_database_endpoint": "$SLURM_DB_ENDPOINT_ADDRESS",
       "slurm_shared_directory": "/fsx",
       "slurm_database_user": "$DB_USER_NAME",
       "slurm_sns_arn": "$SLURM_SNS_FAILOVER_TOPIC_ARN"
     }
   }
   EOF
   ```

1. Upload the `provisioning_parameters.json` file to the same Amazon S3 bucket where you store the lifecycle scripts.

   ```
   aws s3 cp /tmp/provisioning_parameters.json s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config/provisioning_parameters.json
   ```
**Note**  
If you are using API-driven configuration, the `provisioning_parameters.json` file is not required. With API-driven configuration, you define Slurm node types, partitions, and FSx mounting directly in the CreateCluster API payload. For details, see [Getting started with SageMaker HyperPod using the AWS CLI](smcluster-getting-started-slurm-cli.md).

## Verify files in Amazon S3 bucket


After you upload all the lifecycle scripts and the `provisioning_parameters.json` file, your Amazon S3 bucket should look like the following.

![\[Image showing all the lifecycle scripts uploaded to the Amazon S3 bucket in the Amazon Simple Storage Service console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-scripts-s3.png)


For more information, see [Start with base lifecycle scripts provided by HyperPod](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.html).

# Creating a SageMaker HyperPod cluster
Creating a cluster

After setting up all the required resources and uploading the scripts to the Amazon S3 bucket, you can create a cluster.

1. To create a cluster, run the [https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html) AWS CLI command. The creation process can take up to 15 minutes to complete.

   ```
   aws --region $REGION sagemaker create-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --vpc-config '{
           "SecurityGroupIds":["'$SECURITY_GROUP'"],
           "Subnets":["'$PRIMARY_SUBNET'", "'$BACKUP_SUBNET'"]
       }' \
       --instance-groups '[{                  
       "InstanceGroupName": "'$CONTOLLER_IG_NAME'",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$SLURM_EXECUTION_ROLE_ARN'",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "'$COMPUTE_IG_NAME'",          
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$COMPUTE_NODE_ROLE'",
       "ThreadsPerCore": 1
   }]'
   ```

   After successful execution, the command returns the cluster ARN like the following.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster_id"
   }
   ```

1. (Optional) To check the status of your cluster, you can use the SageMaker AI console ([https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)). From the left navigation, choose **HyperPod Clusters**, then choose **Cluster Management**. Choose a cluster name to open the cluster details page. If your cluster is created successfully, you will see the cluster status is **InService**.  
![\[Image showing a HyperPod Slurm cluster with multiple controller nodes in the Amazon SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-cluster.png)

# Considering important notes
Considerations

This section provides several important notes which you might find helpful. 

1. To migrate to a multi-controller Slurm cluster, complete these steps.

   1. Follow the instructions in [Provisioning resources using CloudFormation stacks](sagemaker-hyperpod-multihead-slurm-cfn.md) to provision all the required resources.

   1. Follow the instructions in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md) to upload the updated lifecycle scripts. When updating the `provisioning_parameters.json` file, move your existing controller group to the `worker_groups` section, and add a new controller group name in the `controller_group` section.

   1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) API call to create a new controller group and keep the original compute instance groups and controller group.

1. To scale down the number of controller nodes, use the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command. For each controller instance group, the minimum number of controller nodes you can scale down to is 1. This means that you cannot scale down the number of controller nodes to 0.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command.

   The following is an example CLI command to scale down the number of controller nodes.

   ```
   aws sagemaker update-cluster \
       --cluster-name my_cluster \
       --instance-groups '[{                  
       "InstanceGroupName": "controller_ig_name",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 3,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "slurm_execution_role_arn",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "compute-ig_name",       
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "compute_node_role_arn",
       "ThreadsPerCore": 1
   }]'
   ```

1. To batch delete the controller nodes, use the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command. For each controller instance group, you must keep at least one controller node. If you want to batch delete all the controller nodes, the API operation won't work.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command.

   The following is an example CLI command to batch delete the controller nodes.

   ```
   aws sagemaker batch-delete-cluster-nodes --cluster-name my_cluster --node-ids instance_ids_to_delete
   ```

1. To troubleshoot your cluster creation issues, check the failure message from the cluster details page in your SageMaker AI console. You can also use CloudWatch logs to troubleshoot cluster creation issues. From the CloudWatch console, choose **Log groups**. Then, search `clusters` to see the list of log groups related to your cluster creation.  
![\[Image showing Amazon SageMaker HyperPod cluster log groups in the CloudWatch console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-logs.png)

# Reviewing environment variables reference
Environment variables reference

The following environment variables are defined and used in the tutorial of [Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster](sagemaker-hyperpod-multihead-slurm-setup.md). These environment variables are only available in the current session unless explicitly preserved. They are defined using the `$variable_name` syntax. Variables with key/value pairs represent AWS-created resources, while variables without keys are user-defined.


**Environment variables reference**  

| Variable | Description | 
| --- | --- | 
| \$1BACKUP\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1NODE\$1ROLE |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1CONTOLLER\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1DB\$1USER\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1EMAIL |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1PRIMARY\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1POLICY |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1REGION |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1ROOT\$1BUCKET\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SECURITY\$1GROUP |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1ENDPOINT\$1ADDRESS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1SECRET\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1EXECUTION\$1ROLE\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1DNS\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1MOUNT\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1SNS\$1FAILOVER\$1TOPIC\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 