Scale down a SageMaker HyperPod cluster
You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.
The following page outlines two main approaches to scaling down:
-
Scale down at the instance group level: This approach uses the
UpdateCluster
API, with which you can scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. -
Scale down at the instance level: This approach uses the
BatchDeleteClusterNodes
API, with which you can specify the individual nodes you want to terminate.
Note
When scaling down at the instance level with BatchDeleteCusterNodes
, you
can only terminate a maximum of 99 instances at a time. UpdateCluster
supports terminating any number of instances.
Important considerations
-
When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions.
-
Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see Use the backup script provided by SageMaker HyperPod.
-
To invoke this API on an existing cluster, you must first patch the cluster by running the UpdateClusterSoftware API. For more information about patching a cluster, see Update the SageMaker HyperPod platform software of a cluster.
-
Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
-
You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.
Scale down at the instance group level
The UpdateCluster
operation allows you to make changes to the
configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances
of an instance group. This can be useful when you want to adjust the resources allocated
to your cluster based on changes in your workload, optimize costs, or change the
instance type of an instance group.
Use this approach when you have an instance group that is idle and it's safe to
terminate any of the instances for scaling down. When you submit an
UpdateCluster
request to scale down, HyperPod randomly chooses
instances for termination and scales down to the specified number of nodes for the
instance group.
Note
When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.
To scale down with UpdateCluster
-
Follow the steps outlined in Update SageMaker HyperPod cluster configuration. When you reach step 1.d where you specify the InstanceCount field, enter a number that is smaller than the current number of instances to scale down the cluster.
-
Run the update-cluster AWS CLI command to submit your request.
The following is an example of an UpdateCluster
JSON object. Consider the
case where your instance group currently has 2 running instances. If you set the
InstanceCount field to 1, as shown in the example, then
HyperPod randomly selects one of the instances and terminates it.
{ "ClusterName":
"name-of-cluster-to-update"
, "InstanceGroups": [ { "InstanceGroupName":"training-instances"
, "InstanceType":"instance-type"
, "InstanceCount":1
, "LifeCycleConfig": { "SourceS3Uri":"s3://amzn-s3-demo-bucket/training-script.py"
, "OnCreate":"s3://amzn-s3-demo-bucket/setup-script.sh"
}, "ExecutionRole":"arn:aws:iam::123456789012:role/SageMakerRole"
, "ThreadsPerCore":number-of-threads
, "OnStartDeepHealthChecks": [ "InstanceStress", "InstanceConnectivity" ] } ], "NodeRecovery":"Automatic"
}
Scale down at the instance level
The BatchDeleteClusterNodes
operation allows you to scale down a SageMaker HyperPod
cluster by specifying the individual nodes you want to terminate.
BatchDeleteClusterNodes
provides more granular control for targeted node
removal and cluster optimization. For example, you might use
BatchDeleteClusterNodes
to delete targeted nodes for maintenance, rolling
upgrades, or rebalancing resources geographically.
API request and response
When you submit a BatchDeleteClusterNodes
request, SageMaker HyperPod deletes nodes
by their instance IDs. The API accepts a request with the cluster name and a list of node
IDs to be deleted.
The response includes two sections:
-
Failed
: A list of errors of typeBatchDeleteClusterNodesError
- one per instance ID. -
Successful
: The list of instance IDs successfully terminated.
Validation and error handling
The API performs various validations, such as:
-
Verifying the node ID format (prefix of
i-
and Amazon EC2 instance ID structure). -
Checking the node list length, with a limit of 99 or fewer node IDs in a single
BatchDeleteClusterNodes
request. -
Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
-
Handling cases where instances are not found, have invalid status, or are in use.
API Response Codes
-
The API returns a
200
status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation). -
If all of these validations fail (e.g., all input nodes fail validation), the API will return a
400
Bad Request response with the appropriate error messages and error codes.
Example
The following is an example of scaling down a cluster at the instance level using the AWS CLI:
aws sagemaker batch-delete-cluster-nodes --cluster-name
"cluster-name"
--node-ids'["i-111112222233333", "i-111112222233333"]'