Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.
-
Before creating an SageMaker HyperPod cluster:
-
Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see Create an Amazon EKS cluster in the Amazon EKS User Guide.
-
Install the Helm chart as instructed in Install packages on the Amazon EKS cluster using Helm.
-
-
Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as
s3://
.amzn-s3-demo-bucket
/Lifecycle-scripts
/base-config
/For a quick start, download the sample script
on_create.sh
from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. This script sets up the logging file /var/log/provision/provisioning.log
required for CloudWatch to gather logs from Pod containers. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.Important
If you create an IAM role for SageMaker HyperPod attaching only the managed
AmazonSageMakerClusterInstanceRolePolicy
, your cluster has access to Amazon S3 buckets with the specific prefixsagemaker-
. -
Prepare a CreateCluster API request file in JSON format. For
ExecutionRole
, provide the ARN of the IAM role you created with the managedAmazonSageMakerClusterInstanceRolePolicy
from the section IAM role for SageMaker HyperPod.Note
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.
// create_cluster.json
{ "ClusterName":"string"
, "InstanceGroups": [{ "InstanceGroupName":"string"
, "InstanceType":"string"
, "InstanceCount":number
, "LifeCycleConfig": { "SourceS3Uri":"s3://amzn-s3-demo-bucket-sagemaker>/<lifecycle-script-directory>/src/"
, "OnCreate":"on_create.sh"
}, "ExecutionRole":"string"
, "ThreadsPerCore":number
, "OnStartDeepHealthChecks": ["InstanceStress", "InstanceConnectivity"
] }], "VpcConfig": { "SecurityGroupIds": ["string"
], "Subnets": ["string"
] }, "Tags": [{ "Key":"string"
, "Value":"string"
}], "Orchestrator": { "Eks": { "ClusterArn":"string"
, } }, "NodeRecovery": "Automatic" }Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
-
You can configure up to 20 instance groups under the
InstanceGroups
parameter. -
For
Orchestator.Eks.ClusterArn
, specify the ARN of the EKS cluster you want to use as the orchestrator. -
For
OnStartDeepHealthChecks
, addInstanceStress
andInstanceConnectivity
to enable Deep health checks. -
For
NodeRecovery
, specifyAutomatic
to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent. -
For the
Tags
parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see Tagging AWS Resources User Guide. -
For the
VpcConfig
parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
-
-
Run the create-cluster command as follows.
Important
When running the
create-cluster
command with the--cli-input-json
parameter, you must include thefile://
prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting thefile://
prefix results in a parsing parameter error.aws sagemaker create-cluster \ --cli-input-json
file://complete/path/to/create_cluster.json
This should return the ARN of the new cluster.