Scale down a SageMaker HyperPod cluster

Focus mode

Scale down a SageMaker HyperPod cluster - Amazon SageMaker AI

Important considerations Scale down at the instance group level Scale down at the instance level

You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.

The following page outlines two main approaches to scaling down:

Scale down at the instance group level: This approach uses the UpdateCluster API, with which you can:
- Scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. See Scale down an instance group.
- Completely delete instance groups from your cluster. See Delete instance groups.
Scale down at the instance level: This approach uses the BatchDeleteClusterNodes API, with which you can specify the individual nodes you want to terminate. See Scale down at the instance level.

Note

When scaling down at the instance level with BatchDeleteCusterNodes, you can only terminate a maximum of 99 instances at a time. UpdateCluster supports terminating any number of instances.

Important considerations

When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions.
Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see Use the backup script provided by SageMaker HyperPod.
To invoke this API on an existing cluster, you must first patch the cluster by running the UpdateClusterSoftware API. For more information about patching a cluster, see Update the SageMaker HyperPod platform software of a cluster.
Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.

Scale down at the instance group level

The UpdateCluster operation allows you to make changes to the configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances of an instance group or removing entire instance groups. This can be useful when you want to adjust the resources allocated to your cluster based on changes in your workload, optimize costs, or change the instance type of an instance group.

Scale down an instance group

Use this approach when you have an instance group that is idle and it's safe to terminate any of the instances for scaling down. When you submit an UpdateCluster request to scale down, HyperPod randomly chooses instances for termination and scales down to the specified number of nodes for the instance group.

Note

When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.

Alternatively, you can choose to remove an instance group permanently. For more information, see Delete instance groups.

To scale down with `UpdateCluster`

Follow the steps outlined in Update SageMaker HyperPod cluster configuration. When you reach step 1.d where you specify the InstanceCount field, enter a number that is smaller than the current number of instances to scale down the cluster.
Run the update-cluster AWS CLI command to submit your request.

The following is an example of an UpdateCluster JSON object. Consider the case where your instance group currently has 2 running instances. If you set the InstanceCount field to 1, as shown in the example, then HyperPod randomly selects one of the instances and terminates it.


{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training-instances",
      "InstanceType": "instance-type",
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    }
  ],
  "NodeRecovery": "Automatic"
}

Delete instance groups

You can use the UpdateCluster operation to remove entire instance groups from your SageMaker HyperPod cluster when they are no longer needed. This goes beyond simple scaling down, allowing you to completely eliminate specific instance groups from your cluster's configuration.

Note

When removing an instance group:

All instances within the targeted group are terminated.
The entire group configuration is deleted from the cluster.
Any workloads running on that instance group are stopped.

To delete instance groups with `UpdateCluster`

When following the steps outlined in Update SageMaker HyperPod cluster configuration:
1. Set the optional InstanceGroupsToDelete parameter in your UpdateCluster JSON and pass the comma-separated list of instance group names that you want to delete.
2. When you specify the InstanceGroups list, ensure that the specifications of the instance groups you are removing are no longer listed in the InstanceGroups list.
Run the update-cluster AWS CLI command to submit your request.

Important

Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.

The following is an example of an UpdateCluster JSON object. Consider the case where a cluster currently has 3 instance groups, a training, a prototype-training, and an inference-serving group. You want to delete the prototype-training group.


{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training",
      "InstanceType": "instance-type",
      "InstanceCount": ,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    },
    {
      "InstanceGroupName": "inference-serving",
      "InstanceType": "instance-type",
      "InstanceCount": 2,
      [...]
    },
  ],
  "InstanceGroupsToDelete": [ "prototype-training" ],
  "NodeRecovery": "Automatic"
}

Scale down at the instance level

The BatchDeleteClusterNodes operation allows you to scale down a SageMaker HyperPod cluster by specifying the individual nodes you want to terminate. BatchDeleteClusterNodes provides more granular control for targeted node removal and cluster optimization. For example, you might use BatchDeleteClusterNodes to delete targeted nodes for maintenance, rolling upgrades, or rebalancing resources geographically.

API request and response

When you submit a BatchDeleteClusterNodes request, SageMaker HyperPod deletes nodes by their instance IDs. The API accepts a request with the cluster name and a list of node IDs to be deleted.

The response includes two sections:

Failed: A list of errors of type BatchDeleteClusterNodesError - one per instance ID.
Successful: The list of instance IDs successfully terminated.

Validation and error handling

The API performs various validations, such as:

Verifying the node ID format (prefix of i- and Amazon EC2 instance ID structure).
Checking the node list length, with a limit of 99 or fewer node IDs in a single BatchDeleteClusterNodes request.
Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
Handling cases where instances are not found, have invalid status, or are in use.

API Response Codes

The API returns a 200 status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation).
If all of these validations fail (e.g., all input nodes fail validation), the API will return a 400 Bad Request response with the appropriate error messages and error codes.

Example

The following is an example of scaling down a cluster at the instance level using the AWS CLI:


aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Access SageMaker HyperPod cluster nodes

Delete a SageMaker HyperPod cluster

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Scale down a SageMaker HyperPod cluster

Note

Important considerations

Scale down at the instance group level

Scale down an instance group

Note

To scale down with `UpdateCluster`

Delete instance groups

Note

To delete instance groups with `UpdateCluster`

Important

Scale down at the instance level

On this page

Did this page help you?

Next topic:

Previous topic:

Need help?

Select your cookie preferences

Scale down a SageMaker HyperPod cluster

Note

Important considerations

Scale down at the instance group level

Scale down an instance group

Note

To scale down with UpdateCluster

Delete instance groups

Note

To delete instance groups with UpdateCluster

Important

Scale down at the instance level

On this page

Did this page help you?

Next topic:

Previous topic:

Need help?

To scale down with `UpdateCluster`

To delete instance groups with `UpdateCluster`