Prerequisites for using SageMaker HyperPod
The following sections walk you through prerequisites before getting started with SageMaker HyperPod.
Topics
SageMaker HyperPod quotas
You can create SageMaker HyperPod clusters given the quotas for cluster usage in your AWS account.
Important
To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker AI Pricing
View Amazon SageMaker HyperPod quotas using the AWS Management Console
Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.
-
Open the Service Quotas console
. -
In the left navigation pane, choose AWS services.
-
From the AWS services list, search for and select Amazon SageMaker AI.
-
In the Service quotas list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable.
-
In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.
Request a Amazon SageMaker HyperPod quota increase using the AWS Management Console
Increase your quotas at the account or resource level.
-
To increase the quota of instances for cluster usage, select the quota that you want to increase.
-
If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.
-
For Increase quota value, enter the new value. The new value must be greater than the current value.
-
Choose Request.
-
To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with AWS Support. Choose the case number to open the ticket for your request.
To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the AWS Service Quotas User Guide.
Setting up SageMaker HyperPod with a custom Amazon VPC
To set up a SageMaker HyperPod cluster with a custom Amazon VPC, review the following prerequisites.
Note
VPC configuration is mandatory for Amazon EKS orchestration. For Slurm orchestration, VPC setup is optional.
-
Validate Elastic Network Interface (ENI) capacity in your AWS account before creating a SageMaker HyperPod cluster with a custom VPC. The ENI limit is controlled by Amazon EC2 and varies by AWS Region. SageMaker HyperPod cannot automatically request quota increases.
To verify your current ENI quota:
-
Open the Service Quotas console
. -
In the Manage quotas section, use the AWS Services drop-down list to search for VPC.
-
Choose to view the quotas of Amazon Virtual Private Cloud (Amazon VPC).
-
Look for the service quota Network interfaces per Region or the Quota code
L-DF5E4CA3
.
If your current ENI limit is insufficient for your SageMaker HyperPod cluster needs, request a quota increase. Ensuring adequate ENI capacity beforehand helps prevent cluster deployment failures.
-
-
When using a custom VPC to connect a SageMaker HyperPod cluster with AWS resources, provide the VPC name, ID, AWS Region, subnet IDs, and security group IDs during cluster creation.
Note
When your Amazon VPC and subnets support IPv6 in the
VPCConfig
of the cluster or at the Instance group level using theOverrideVPCConfig
attribute ofClusterInstanceGroupSpecification
, network communications differ based on the cluster orchestration platform:-
Slurm-orchestrated clusters automatically configure nodes with dual IPv6 and IPv4 addresses, allowing immediate IPv6 network communications. No additional configuration is required beyond the
VPCConfig
IPv6 settings. -
In EKS-orchestrated clusters, nodes receive dual-stack addressing, but pods can only use IPv6 when the Amazon EKS cluster is explicitly IPv6-enabled. You must create a new IPv6 Amazon EKS cluster - existing IPv4 Amazon EKS clusters cannot be converted to IPv6. For information about deploying an IPv6 Amazon EKS cluster, see Amazon EKS IPv6 Cluster Deployment.
Additional resources for IPv6 configuration:
-
For information about adding IPv6 support to your VPC, see to IPv6 Support for VPC.
-
For information about creating a new IPv6-compatible VPC, see Amazon VPC Creation Guide.
-
To configure SageMaker HyperPod with a custom Amazon VPC, see Custom Amazon VPC setup for SageMaker HyperPod.
-
-
Make sure that all resources are deployed in the same AWS Region as the SageMaker HyperPod cluster. Configure security group rules to allow inter-resource communication within the VPC. For example, when creating a VPC in
us-west-2
, provision subnets across one or more Availability Zones (such asus-west-2a
orus-west-2b
), and create a security group allowing intra-group traffic.Note
SageMaker HyperPod supports multi-Availability Zone deployment. For more information, see Setting up SageMaker HyperPod clusters across multiple AZs.
-
Establish Amazon Simple Storage Service (Amazon S3) connectivity for VPC-deployed SageMaker HyperPod instance groups by creating a VPC endpoint. Without internet access, instance groups cannot store or retrieve lifecycle scripts, training data, or model artifacts. We recommend that you create a custom IAM policy restricting Amazon S3 bucket access to the private VPC. For more information, see Endpoints for Amazon S3 in the AWS PrivateLink Guide.
-
For HyperPod clusters using Elastic Fabric Adapter (EFA)-enabled instances, configure the security group to allow all inbound and outbound traffic to and from the security group itself. Specifically, avoid using
0.0.0.0/0
for outbound rules, as this may cause EFA health check failures. For more information about EFA security group preparation guidelines, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.
Setting up SageMaker HyperPod clusters across multiple AZs
You can configure your SageMaker HyperPod clusters across multiple Availability Zones (AZs) to improve reliability and availability.
Note
Elastic Fabric Adapter (EFA) traffic cannot cross AZs or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface. For more information, see EFA limitations.
-
Default behavior
HyperPod deploys all cluster instances in a single Availability Zone. The VPC configuration determines the deployment AZ:
-
For Slurm-orchestrated clusters, VPC configuration is optional. When no VPC configuration is provided, HyperPod defaults to one subnet from the platform VPC.
-
For EKS-orchestrated clusters, VPC configuration is required.
-
For both Slurm and EKS orchestrators, when
VpcConfig
is provided, HyperPod selects a subnet from the providedVpcConfig
's subnet list. All instance groups inherit the subnet's AZ.
Note
Once you create a cluster, you cannot modify its
VpcConfig
settings.To learn more about configuring VPCs for HyperPod clusters, see the preceding section, Setting up SageMaker HyperPod with a custom Amazon VPC.
-
-
Multi-AZ configuration
You can set up your HyperPod cluster across multiple AZs when creating a cluster or when adding a new instance group to an existing cluster. To configure multi-AZ deployments, you can override the default VPC settings of the cluster by specifying different subnets and security groups, potentially across different Availability Zones, for individual instance groups within your cluster.
SageMaker HyperPod API users can use the
OverrideVpcConfig
property within the ClusterInstanceGroupSpecification when working with theCreateCluster
orUpdateCluster
APIs.The
OverrideVpcConfig
field:-
Cannot be modified after the instance group is created.
-
Is optional. If not specified, the cluster level
VpcConfig
is used as default. -
For Slurm-orchestrated clusters, can only be specified when cluster level
VpcConfig
is provided. If noVpcConfig
is specified at cluster level,OverrideVpcConfig
cannot be used for any instance group. -
Contains two required fields:
-
Subnets
- accepts between 1 and 16 subnet IDs -
SecurityGroupIds
- accepts between 1 and 5 security group IDs
-
For more information about creating or updating a SageMaker HyperPod cluster using the SageMaker HyperPod console UI or the AWS CLI:
-
Slurm orchestration: See Operating Slurm-orchestrated HyperPod clusters.
-
EKS orchestration. See Operating EKS-orchestrated HyperPod clusters.
-
Note
When running workloads across multiple AZs, be aware that network communication between AZs introduces additional latency. Consider this impact when designing latency-sensitive applications.
Setting up AWS Systems Manager and Run As for cluster user access control
SageMaker HyperPod DLAMI comes with AWS Systems Manager
Note
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.
Enabling Run As in your AWS account
As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.
To enable Run As in your AWS account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.
(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre
To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use.
If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with your VPC.