Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Run a training job on HyperPod Slurm

Focus mode
Run a training job on HyperPod Slurm - Amazon SageMaker AI

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium slurm cluster. Before you submit the training job, update the cluster configuration. Use one of the following methods to update the cluster configuration:

  • Modify slurm.yaml

  • Override it through the command line

After you've updated the cluster configuration, install the environment.

Configure the cluster

To submit a training job to a Slurm cluster, specify the Slurm-specific configuration. Modify slurm.yaml to configure the Slurm cluster. The following is an example of a Slurm cluster configuration. You can modify this file for your own training needs:

job_name_prefix: 'sagemaker-' slurm_create_submission_file_only: False stderr_to_stdout: True srun_args: # - "--no-container-mount-home" slurm_docker_cfg: docker_args: # - "--runtime=nvidia" post_launch_commands: container_mounts: - "/fsx:/fsx"
  1. job_name_prefix: Specify a job name prefix to easily identify your submissions to the Slurm cluster.

  2. slurm_create_submission_file_only: Set this configuration to True for a dry run to help you debug.

  3. stderr_to_stdout: Specify whether you're redirecting your standard error (stderr) to standard output (stdout).

  4. srun_args: Customize additional srun configurations, such as excluding specific compute nodes. For more information, see the srun documentation.

  5. slurm_docker_cfg: The SageMaker HyperPod recipe launcher launches a Docker container to run your training job. You can specify additional Docker arguments within this parameter.

  6. container_mounts: Specify the volumes you're mounting into the container for the recipe launcher, for your training jobs to access the files in those volumes.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.