Amazon SageMaker HyperPod release notes - Amazon SageMaker

Amazon SageMaker HyperPod release notes

See the following release notes to track the latest updates for Amazon SageMaker HyperPod.

SageMaker HyperPod release notes: June 20, 2024

New features

  • Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the CreateCluster and UpdateCluster APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to /opt/sagemaker. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.

    Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

SageMaker HyperPod release notes: April 24, 2024

Bug fixes

  • Fixed a bug with the ThreadsPerCore parameter in the ClusterInstanceGroupSpecification API. With the fix, the CreateCluster and UpdateCluster APIs properly take and apply the user input through ThreadsPerCore. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at Use the backup script provided by SageMaker HyperPod.

SageMaker HyperPod release notes: March 27, 2024

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • In this release of the HyperPod DLAMI, Slurm is built with REST service (slurmestd) with JSON, YAML, and JWT support.

  • Upgraded Slurm to v23.11.3

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

Improvements

  • Increased auto-resume service timeout to 60 minutes.

  • Improved instance replacement process to not restart the Slurm controller.

  • Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.

Bug fixes

  • Fixed a bug with chrony service that caused an issue with time synchronization.

  • Fixed a bug with parsing slurm.conf.

  • Fixed an issue with NVIDIA go-dcgm library.

SageMaker HyperPod release notes: March 14, 2024

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • Upgraded Slurm to v23.11.1

  • Added OpenPMIx v4.2.6 for enabling Slurm with PMIx.

  • Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) released on 2023-10-26

  • A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI

    • Slurm: v23.11.1

    • OpenPMIx : v4.2.6

    • Munge: v0.5.15

    • aws-neuronx-dkms: v2.*

    • aws-neuronx-collectives: v2.*

    • aws-neuronx-runtime-lib: v2.*

    • aws-neuronx-tools: v2.*

    • SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

Improvements

SageMaker HyperPod release notes: February 15, 2024

New features

  • Added a new UpdateClusterSoftware API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by running aws sagemaker update-cluster-software --cluster-name your-cluster-name. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how the UpdateClusterSoftware API works, see Update the SageMaker HyperPod platform software of a cluster.

SageMaker HyperPod release notes: November 29, 2023

New features

  • Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) released on 2023-10-18

  • A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI

    • Slurm: v23.02.3

    • Munge: v0.5.15

    • aws-neuronx-dkms: v2.*

    • aws-neuronx-collectives: v2.*

    • aws-neuronx-runtime-lib: v2.*

    • aws-neuronx-tools: v2.*

    • SageMaker HyperPod software packages to support features such as cluster health check and auto-resume