Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Create a SageMaker HyperPod cluster

Focus mode
Create a SageMaker HyperPod cluster - Amazon SageMaker AI

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.

  1. Before creating an SageMaker HyperPod cluster:

    1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see Create an Amazon EKS cluster in the Amazon EKS User Guide.

    2. Install the Helm chart as instructed in Install packages on the Amazon EKS cluster using Helm.

  2. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/.

    For a quick start, download the sample script on_create.sh from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. This script sets up the logging file /var/log/provision/provisioning.log required for CloudWatch to gather logs from Pod containers. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.

    Important

    If you create an IAM role for SageMaker HyperPod attaching only the managed AmazonSageMakerClusterInstanceRolePolicy, your cluster has access to Amazon S3 buckets with the specific prefix sagemaker-.

  3. Prepare a CreateCluster API request file in JSON format. For ExecutionRole, provide the ARN of the IAM role you created with the managed AmazonSageMakerClusterInstanceRolePolicy from the section IAM role for SageMaker HyperPod.

    Note

    Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

    // create_cluster.json { "ClusterName": "string", "InstanceGroups": [{ "InstanceGroupName": "string", "InstanceType": "string", "InstanceCount": number, "LifeCycleConfig": { "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker>/<lifecycle-script-directory>/src/", "OnCreate": "on_create.sh" }, "ExecutionRole": "string", "ThreadsPerCore": number, "OnStartDeepHealthChecks": [ "InstanceStress", "InstanceConnectivity" ] }], "VpcConfig": { "SecurityGroupIds": ["string"], "Subnets": ["string"] }, "Tags": [{ "Key": "string", "Value": "string" }], "Orchestrator": { "Eks": { "ClusterArn": "string", } }, "NodeRecovery": "Automatic" }

    Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.

    • You can configure up to 20 instance groups under the InstanceGroups parameter.

    • For Orchestator.Eks.ClusterArn, specify the ARN of the EKS cluster you want to use as the orchestrator.

    • For OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to enable Deep health checks.

    • For NodeRecovery, specify Automatic to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

    • For the Tags parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see Tagging AWS Resources User Guide.

    • For the VpcConfig parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.

  4. Run the create-cluster command as follows.

    Important

    When running the create-cluster command with the --cli-input-json parameter, you must include the file:// prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting the file:// prefix results in a parsing parameter error.

    aws sagemaker create-cluster \ --cli-input-json file://complete/path/to/create_cluster.json

    This should return the ARN of the new cluster.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.