Using the SageMaker HyperPod console UI - Amazon SageMaker

Using the SageMaker HyperPod console UI

Create your first SageMaker HyperPod cluster using the SageMaker HyperPod console UI.

Create your first SageMaker HyperPod cluster with Slurm

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the SageMaker console UI. Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes, my-controller-group, my-login-group, and worker-group-1.

  1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

  2. Choose HyperPod Clusters in the left navigation pane.

  3. On the SageMaker HyperPod Clusters page, choose Create cluster.

  4. In Step 1: Cluster settings, specify a name for the new cluster. Skip the Tags section.

  5. In Step 2: Instance groups, add instance groups. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. For lifecycle configuration scripts to run on the instance group during cluster creation, you can start with using the sample lifecycle scripts provided in the Awsome Distributed Training GitHub repository.

    1. For Instance group name, specify a name for the instance group. For this tutorial, create three instance groups named my-controller-group, my-login-group, and worker-group-1.

    2. For Select instance type, choose the instance for the instance group. For this tutorial, select ml.c5.xlarge for my-controller-group, ml.m5.4xlarge for my-login-group, and ml.trn1.32xlarge for worker-group-1.

      Ensure that you choose the instance type with sufficient quotas in your account, or request additional quotas by following at SageMaker HyperPod quotas.

    3. For Quantity, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter 1 for all three groups.

    4. For S3 path to lifecycle script files, enter the Amazon S3 path in which your lifecycle scripts are stored. If you don't have lifecycle scripts, go through the following substeps to use the base lifecycle scripts provided by the SageMaker HyperPod service team.

      1. Clone the Awsome Distributed Training GitHub repository.

        git clone https://github.com/aws-samples/awsome-distributed-training/
      2. Under 1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config, you can find a set of base lifecycle scripts. To learn more about the lifecycle scripts, see also Prepare lifecycle scripts for setting up Slurm on SageMaker HyperPod.

      3. Write a Slurm configuration file and save it as provisioning_params.json. In the file, specify basic Slurm configuration parameters to properly assign Slurm nodes to the SageMaker HyperPod cluster instance groups. For example, the provisioning_params.json should be similar to the following based on the HyperPod cluster instance group configured through the previous steps 5a, 5b, and 5c.

        { "version": "1.0.0", "workload_manager": "slurm", "controller_group": "my-controller-group", "login_group": "my-login-group", "worker_groups": [ { "instance_group_name": "worker-group-1", "partition_name": "partition-1" } ] }
      4. Upload the scripts to your Amazon S3 bucket. Create an S3 bucket with a path in the following format: s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src. You can create this bucket using the Amazon S3 console.

        Note

        You must prefix sagemaker- to the S3 bucket path, because the IAM role for SageMaker HyperPod with AmazonSageMakerClusterInstanceRolePolicy only allows principals to access S3 buckets with this specific prefix.

    5. For Directory path to your on-create lifecycle script, enter the file name of the lifecycle script under S3 path to lifecycle script files.

    6. For IAM role, choose the IAM role you created using the AmazonSageMakerClusterInstanceRolePolicy from the section IAM role for SageMaker HyperPod.

    7. Under Advanced configuration, you can set up the following optional configurations.

      1. (Optional) For Threads per core, specify 1 for disabling multi-threading and 2 for enabling multi-threading. To find which instance type supports multi-threading, see the reference table of CPU cores and threads per CPU core per instance type in the Amazon Elastic Compute Cloud User Guide.

      2. (Optional) For Additional instance storage configs, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is /opt/sagemaker. After the cluster is succefully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the df -h command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the Amazon EBS volumes section in the Amazon Elastic Block Store User Guide.

  6. In Step 3: Advanced configuration, set up network settings within, and in and out of, the cluster. Select your own VPC if you already have one that gives SageMaker access to your VPC. If you don't have one but want to create a new VPC, follow the instructions at Create a VPC in the Amazon Virtual Private Cloud User Guide. You can leave it as no VPC to use the default SageMaker VPC.

  7. In Step 4: Review and create, review the configuration you've set from step 1 to 3 and finish submitting the cluster creation request.

  8. The new cluster should appear under Clusters in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the Status column.

  9. After the status of the cluster turns to InService, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see Run jobs on SageMaker HyperPod clusters.

Delete the cluster and clean resources

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the InService state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker instances when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of two instance groups. One of them uses a C5 instance, so make sure you delete the cluster by following the instructions at Delete a SageMaker HyperPod cluster.

However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.

To clean up the lifecycle scripts from the S3 bucket used for this tutorial, go to the S3 bucket you used during cluster creation and remove the files entirely.

If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.