Prerequisites for using SageMaker HyperPod - Amazon SageMaker

Prerequisites for using SageMaker HyperPod

The following sections walk you through prerequisites before getting started with SageMaker HyperPod.

SageMaker HyperPod quotas

You can create SageMaker HyperPod clusters given the quotas for cluster usage in your AWS account.

Important

To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker Pricing.

View Amazon SageMaker HyperPod quotas using the AWS Management Console

Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.

  1. Open the Service Quotas console.

  2. In the left navigation pane, choose AWS services.

  3. From the AWS services list, search for and select Amazon SageMaker.

  4. In the Service quotas list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable.

  5. In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.

Request for Amazon SageMaker HyperPod quotas using the AWS Management Console

Increase your quotas at the account or resource level.

  1. To increase the quota of instances for cluster usage, select the quota that you want to increase.

  2. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.

  3. For Increase quota value, enter the new value. The new value must be greater than the current value.

  4. Choose Request.

  5. To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with AWS Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the AWS Service Quotas User Guide.

Setting up SageMaker HyperPod with Amazon VPC

To set up a SageMaker HyperPod cluster with your Amazon VPC, check the following items.

Note

It is required for orchestrating with EKS, and you must use the same VPC for the HyperPod clutser. For orchestrating with Slurm, setting up your own VPC is optional.

  • If you want to use your own VPC to connect SageMaker HyperPod with AWS resources in your VPC, you need to provide the VPC name, ID, AWS Region, subnet ID, and security group ID when you create SageMaker HyperPod. If you want to create a new VPC, see Create a default VPC or Create a VPC in the Amazon Virtual Private Cloud User Guide.

  • It is important that you should create all your resources in the same AWS Region and Availability Zone, and configure security group rules to allow connection between the resources in your VPC. For example, assume that you create a VPC in us-west-2. You should create a subnet in this VPC in Availability Zone us-west-2a, and create a security group that allows all incoming (inbound) traffic from inside the security group and all outbound traffic.

  • You also need to ensure that your VPC has connection to Amazon Simple Storage Service (Amazon S3). If you configure a VPC, SageMaker HyperPod instance groups don't have access to the internet, and therefore can't connect to Amazon S3 for accessing or storing files such as lifecycle scripts, training data, and model artifacts. To establish connection with Amazon S3 while using VPC, you should create a VPC endpoint. By creating a VPC endpoint, you can allow the SageMaker HyperPod instance groups to access the Amazon S3 buckets within the same VPC. We recommend that you also create a custom policy that only allows requests from your private VPC to access your Amazon S3 buckets. For more information, see Endpoints for Amazon S3 in the AWS PrivateLink Guide.

  • If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Setting up AWS Systems Manager and Run As for cluster user access control

SageMaker HyperPod DLAMI comes with AWS Systems Manager (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

Enabling Run As in your AWS account

As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your AWS account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.

(Slurm) Setting up Linux users using an Amazon FSx file system attached to SageMaker HyperPod as a shared space

To complete setting up cluster users to access a HyperPod cluster through SSM and a shared space, you need to configure a script for adding users while preparing lifecycle configuration scripts for creating a HyperPod cluster. In the GitHub repository introduced in the section Start with base lifecycle scripts provided by HyperPod, there is a script named add_users.sh that reads user data from shared_users.txt. Note that you'll need to upload the two files as part of preparing and uploading lifecycle scripts to an Amazon S3 bucket, which you'll learn in the section Tutorial for getting started with SageMaker HyperPod and the section Set up a multi-user environment through the Amazon FSx shared space.

(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use. If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with VPC.