Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters orchestrated with Amazon EKS. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters.

Tip

For hands-on experience and guidance on how to set up and use an SageMaker HyperPod cluster orchestrated with Amazon EKS, we recommend taking this Amazon EKS Support in SageMaker HyperPod workshop.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the SageMaker HyperPod CLI and the native kubectl commands to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use Kubeflow Training Operator, Kueue (K8s tool for job queuing) and SageMaker-managed MLflow to manage ML experiments and training runs.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Suggested resilience configurations

Install the SageMaker HyperPod CLI