This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.
If you are having issues setting up, please see Troubleshoot for known troubleshooting solutions.
Kueue Settings
HyperPod task governance EKS add-on installs Kueue
EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | Version of kube-rbac-proxy that is installed as part of the add-on |
---|---|---|
v1.0.0 |
v0.8.1 |
v0.18.1 |
HyperPod task governance leverages Kueue for
Kubernetes-native job queueing, scheduling, and quota management, and is
installed with the HyperPod task governance EKS add-on. When
installed, HyperPod creates and modifies SageMaker AI-managed
Kubernetes resources such as KueueManagerConfig
,
ClusterQueues
, LocalQueues
,
WorkloadPriorityClasses
, ResourceFlavors
,
and ValidatingAdmissionPolicies
. While Kubernetes
administrators have the flexibility to modify the state of these
resources, it is possible that any changes made to a SageMaker AI-managed
resource may be updated and overwritten by the service.
The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.
apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
health:
healthProbeBindAddress: :8081
metrics:
bindAddress: :8080
enableClusterQueueResources: true
webhook:
port: 9443
manageJobsWithoutQueueName: false
leaderElection:
leaderElect: true
resourceName: c1f6bfd2.kueue.x-k8s.io
controller:
groupKindConcurrency:
Job.batch: 5
Pod: 5
Workload.kueue.x-k8s.io: 5
LocalQueue.kueue.x-k8s.io: 1
ClusterQueue.kueue.x-k8s.io: 1
ResourceFlavor.kueue.x-k8s.io: 1
clientConnection:
qps: 50
burst: 100
integrations:
frameworks:
- "batch/job"
- "kubeflow.org/mpijob"
- "ray.io/rayjob"
- "ray.io/raycluster"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/mxjob"
- "kubeflow.org/paddlejob"
- "kubeflow.org/pytorchjob"
- "kubeflow.org/tfjob"
- "kubeflow.org/xgboostjob"
- "pod"
podOptions:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: [ kube-system, kueue-system ]
fairSharing:
enable: true
preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
resources:
excludeResourcePrefixes: []
For more information about each configuration entry, see Configuration
HyperPod
Task governance prerequisites
-
If you have not already done so, see IAM users for cluster admin for the example minimum permission policy for HyperPod cluster administrators. This includes permissions run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in SageMaker HyperPod operation.
-
You will need to have your Kubernetes version >= 1.30. For instructions, see Update existing clusters to the new Kubernetes version.
-
If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
-
A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on.
HyperPod task
governance setup
The following provides information on how to get set up with HyperPod task governance.
The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.
You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the HyperPod Amazon CloudWatch Observability EKS add-on setup. If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"eks:ListAddons",
"eks:CreateAddon",
"eks:UpdateAddon",
"eks:DescribeAddon",
"eks:DescribeAddonVersions",
"sagemaker:DescribeCluster",
"sagemaker:DescribeClusterNode",
"sagemaker:ListClusterNodes",
"sagemaker:ListClusters",
"eks:DescribeCluster",
"eks:AccessKubernetesApi"
],
"Resource": "*"
}
]
}
Navigate to the Dashboard tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on.
You can view the Policies tab in the
HyperPod SageMaker AI console if the install was successful. You can also
use the following example describe-addon
aws eks describe-addon --region
region
--cluster-namecluster-name
--addon-name amazon-sagemaker-hyperpod-taskgovernance