Prerequisites for using SageMaker HyperPod
The following sections walk you through prerequisites before getting started with SageMaker HyperPod.
Topics
SageMaker HyperPod quotas
You can create SageMaker HyperPod clusters given the quotas for cluster usage in your AWS account.
Important
To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker Pricing
View Amazon SageMaker HyperPod quotas using the AWS Management Console
Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.
-
Open the Service Quotas console
. -
In the left navigation pane, choose AWS services.
-
From the AWS services list, search for and select Amazon SageMaker.
-
In the Service quotas list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable.
-
In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.
Request for Amazon SageMaker HyperPod quotas using the AWS Management Console
Increase your quotas at the account or resource level.
-
To increase the quota of instances for cluster usage, select the quota that you want to increase.
-
If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.
-
For Increase quota value, enter the new value. The new value must be greater than the current value.
-
Choose Request.
-
To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with AWS Support. Choose the case number to open the ticket for your request.
To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the AWS Service Quotas User Guide.
Setting up SageMaker HyperPod with Amazon VPC
To set up a SageMaker HyperPod cluster with your Amazon VPC, check the following items.
Note
It is required for orchestrating with EKS, and you must use the same VPC for the HyperPod clutser. For orchestrating with Slurm, setting up your own VPC is optional.
-
If you want to use your own VPC to connect SageMaker HyperPod with AWS resources in your VPC, you need to provide the VPC name, ID, AWS Region, subnet ID, and security group ID when you create SageMaker HyperPod. If you want to create a new VPC, see Create a default VPC or Create a VPC in the Amazon Virtual Private Cloud User Guide.
-
It is important that you should create all your resources in the same AWS Region and Availability Zone, and configure security group rules to allow connection between the resources in your VPC. For example, assume that you create a VPC in
us-west-2
. You should create a subnet in this VPC in Availability Zoneus-west-2a
, and create a security group that allows all incoming (inbound) traffic from inside the security group and all outbound traffic. -
You also need to ensure that your VPC has connection to Amazon Simple Storage Service (Amazon S3). If you configure a VPC, SageMaker HyperPod instance groups don't have access to the internet, and therefore can't connect to Amazon S3 for accessing or storing files such as lifecycle scripts, training data, and model artifacts. To establish connection with Amazon S3 while using VPC, you should create a VPC endpoint. By creating a VPC endpoint, you can allow the SageMaker HyperPod instance groups to access the Amazon S3 buckets within the same VPC. We recommend that you also create a custom policy that only allows requests from your private VPC to access your Amazon S3 buckets. For more information, see Endpoints for Amazon S3 in the AWS PrivateLink Guide.
-
If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.
Setting up AWS Systems Manager and Run As for cluster user access control
SageMaker HyperPod DLAMI comes with AWS Systems Manager
Enabling Run As in your AWS account
As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.
To enable Run As in your AWS account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.
(Slurm) Setting up Linux users using an Amazon FSx file system attached to SageMaker HyperPod as a shared space
To complete setting up cluster users to access a HyperPod cluster through
SSM and a shared space, you need to configure a script for adding users while
preparing lifecycle configuration scripts for creating a HyperPod cluster.
In the GitHub repository introduced in the section Start
with base lifecycle scripts provided by HyperPod, there
is a script named add_users.sh
that reads user data from
shared_users.txt
. Note that you'll need to upload the two files as
part of preparing and uploading lifecycle scripts to an Amazon S3 bucket, which you'll
learn in the section Tutorial for getting started with
SageMaker HyperPod and the section
Set up a multi-user environment through the Amazon FSx shared space.
(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre
To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use. If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with VPC.