SageMaker HyperPod AMI releases for Amazon EKS
The following release notes track the latest updates for Amazon SageMaker HyperPod AMI releases for Amazon EKS orchestration. Each release note includes a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMI is built on Amazon Linux 2 (AL2) and supports a specific Kubernetes version. For HyperPod DLAMI releases for Slurm orchestration, see SageMaker HyperPod AMI releases for Slurm. For information about Amazon SageMaker HyperPod feature releases, see Amazon SageMaker HyperPod release notes.
SageMaker HyperPod AMI releases for Amazon EKS: August 25, 2025
SageMaker HyperPod DLAMI for Amazon EKS support
This release includes the following updates:
SageMaker HyperPod AMI releases for Amazon EKS: August 12, 2025
The AMI includes the following:
-
Supported AWS Service: Amazon EC2
-
Operating System: Amazon Linux 2023
-
Compute Architecture: ARM64
-
Latest available version is installed for the following packages:
-
Linux Kernel: 6.12
-
FSx Lustre
-
Docker
-
AWS CLI v2 at
/usr/bin/aws
-
NVIDIA DCGM
-
Nvidia container toolkit:
-
Version command:
nvidia-container-cli -V
-
-
Nvidia-docker2:
-
Version command:
nvidia-docker version
-
-
Nvidia-IMEX: v570.172.08-1
-
-
NVIDIA Driver: 570.158.01
-
NVIDIA CUDA 12.4, 12.5, 12.6, 12.8 stack:
-
CUDA, NCCL and cuDDN installation directories:
/usr/local/cuda-xx.x/
-
Example:
/usr/local/cuda-12.8/
,/usr/local/cuda-12.8/
-
-
Compiled NCCL Version:
-
For CUDA directory of 12.4, compiled NCCL Version 2.22.3+CUDA12.4
-
For CUDA directory of 12.5, compiled NCCL Version 2.22.3+CUDA12.5
-
For CUDA directory of 12.6, compiled NCCL Version 2.24.3+CUDA12.6
-
For CUDA directory of 12.8, compiled NCCL Version 2.27.5+CUDA12.8
-
-
Default CUDA: 12.8
-
PATH
/usr/local/cuda
points to CUDA 12.8 -
Updated below env vars:
-
LD_LIBRARY_PATH
to have/usr/local/cuda-12.8/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8:/usr/local/cuda-12.8/targets/sbsa-linux/lib:/usr/local/cuda-12.8/nvvm/lib64:/usr/local/cuda-12.8/extras/CUPTI/lib64
-
PATH
to have/usr/local/cuda-12.8/bin/:/usr/local/cuda-12.8/include/
-
For any different CUDA version, please update
LD_LIBRARY_PATH
accordingly.
-
-
-
-
EFA installer: 1.42.0
-
Nvidia GDRCopy: 2.5.1
-
AWS OFI NCCL plugin comes with EFA installer
-
Paths
/opt/amazon/ofi-nccl/lib
and/opt/amazon/ofi-nccl/efa
are added toLD_LIBRARY_PATH
.
-
-
AWS CLI v2 at
/usr/local/bin/aws
-
EBS volume type: gp3
-
Python:
/usr/bin/python3.9
SageMaker HyperPod AMI releases for Amazon EKS: August 6, 2025
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following updates:
Important
-
Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 70.3
-
Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 68.4
-
Latest CUDA 12.8 support
-
Upgraded Nvidia Driver to from 570.158.01 to 570.172.08 to fix CVE's present in the Nvidia Security Bulletin for July
SageMaker HyperPod AMI releases for Amazon EKS: July 31, 2025
Amazon SageMaker HyperPod now supports a new AMI for Amazon EKS clusters that updates the base operating system to Amazon Linux 2023. This release provides several improvements from Amazon Linux 2 (AL2). HyperPod releases new AMIs regularly, and we recommend that you run all of your HyperPod clusters on the latest and most secure versions of AMIs to address vulnerabilities and phase out outdated software and libraries.
Key upgrades
-
Operating System: Amazon Linux 2023 (updated from Amazon Linux 2, or AL2)
-
Package Manager: DNF is the default package management tool, replacing YUM used in AL2
-
Networking Service:
systemd-networkd
manages network interfaces, replacing ISCdhclient
used in AL2 -
Linux Kernel: Version 6.1, updated from the kernel used in AL2
-
Glibc: Version 2.34, updated from the version in AL2
-
GCC: Version 11.5.0, updated from the version in AL2
-
NFS: Version 1:2.6.1, updated from version 1:1.3.4 in AL2
-
NVIDIA Driver: Version 570.172.08, a newer driver version
-
Python: Version 3.9, replacing Python 2.7 used in AL2
-
NVME: Version 1.11.1, a newer version of the NVMe driver
Before you upgrade
There are a few important things to know before upgrading. With AL2023, several packages have been added, upgraded or removed compared to AL2. We strongly recommend that you test your applications with AL2023 before upgrading your clusters. For a comprehensive list of all package changes in AL2023, see Package changes in Amazon Linux 2023.
The following are some of the significant changes between AL2 and AL2023:
-
Python 3.10: The most significant update, apart from the operating system, is the Python version upgrade. After upgrading, clusters have Python 3.10 as default. While some Python 3.8 distributed training workloads might be compatible with Python 3.10, we strongly recommend that you test your specific workloads separately. If migration to Python 3.10 proves challenging but you still want to upgrade your cluster for other new features, you can install an older Python version by using the command
yum install python-xx.x
with lifecycle scripts before running any workloads. Ensure you test both your existing lifecycle scripts and application code for compatibility. -
NVIDIA runtime enforcement: AL2023 strictly enforces the NVIDIA container runtime requirements, causing containers with hard-coded NVIDIA environment variables (such as
NVIDIA_VISIBLE_DEVICES: "all"
) to fail on CPU-only nodes (whereas AL2 ignored these settings when no GPU drivers are present). You can override the enforcement by settingNVIDIA_VISIBLE_DEVICES: "void"
in your pod specification or by using CPU-only images. -
cgroup v2: AL2023 features the next generation of unified control group hierarchy (cgroup v2). cgroup v2 is used for container runtimes and is also used by
systemd
. While AL2023 still includes code that can make the system run using cgroup v1, this isn't a recommended configuration. -
Amazon VPC CNI and
eksctl
versions: AL2023 also requires your Amazon VPC CNI version to be 1.16.2 or greater and youreksctl
version to be 0.176.0 or greater. -
EFA on FSx for Lustre: You can now use EFA on FSx for Lustre, which enables you to achieve application performance comparable to on-premises AI/ML or HPC (high performance computing) clusters, while benefiting from the scalability, flexibility and elasticity of cloud computing.
Additionally, upgrading to AL2023 requires at minimum version
1.0.643.0_1.0.192.0
of Health Monitoring Agent. Complete the
following procedure to update the Health Monitoring Agent:
-
If you use HyperPod lifecycle scripts from the GitHub repository awsome-distributed-training
, make sure to pull the latest version. Earlier versions are not compatible with AL2023. The new lifecycle script ensures that containerd
uses the additional mounted storage for pulling in container images in AL2023. -
Pull in the latest version of the HyperPod CLI git repository
. -
Update dependencies with the following command:
helm dependencies update helm_chart/HyperPodHelmChart
-
As mentioned on the step 4 in the README of HyperPodHelmChart
, run the following command to upgrade the version of dependencies running on the cluster: helm upgrade dependencies helm_chart/HyperPodHelmChart -namespace kube-system
Workloads that have been tested on upgraded EKS clusters
The following are some use cases where the upgrade has been tested:
-
Backwards compatibility: Popular distributed training jobs involving PyTorch should be backwards compatible on the new AMI. However, since your workloads may depend on specific Python or Linux libraries, we recommend first testing on a smaller scale or subset of nodes before upgrading your larger clusters.
-
Accelerator testing: Jobs across various instance types, utilizing both NVIDIA accelerators (for the P and G instance families) and AWS Neuron accelerators (for Trn instances) have been tested.
How to upgrade your AMI and associated workloads
You can upgrade to the new AMI using one of the following methods:
-
Use the create-cluster API to create a new cluster with the latest AMI.
-
Use the update-cluster-software API to upgrade your existing cluster. Note that this option re-runs any lifecycle scripts.
The cluster is unavailable during the update process. We recommend planning for this downtime and restarting the training workload from an existing checkpoint after the upgrade completes. As a best practice, we recommend that you perform testing on a smaller cluster before upgrading your larger clusters.
If the update command fails, first identify the cause of the failure. For
lifecycle script failures, make the necessary corrections to your scripts and
retry. For any other issues that cannot be resolved, contact AWS Support
Troubleshooting
Use the following section to help with troubleshooting any issues you encounter when upgrading to AL2023.
How do I fix errors such as "nvml error: driver
not loaded: unknown"
on CPU-only cluster nodes?
If containers that worked on CPU AL2 Amazon EKS nodes now fail on AL2023, your container image may have hard-coded NVIDIA environment variables. You can check for hard-coded environment variables with the following command:
docker inspect image:tag | grep -i nvidia
AL2023 strictly enforces these requirements whereas AL2 was more lenient on CPU-only nodes. One solution is to override the AL2023 enforcement by setting certain NVIDIA environment variables in your Amazon EKS pod specification, as shown in the following example:
yaml containers: - name: your-container image: your-image:tag env: - name: NVIDIA_VISIBLE_DEVICES value: "void" - name: NVIDIA_DRIVER_CAPABILITIES value: ""
Another alternative is to use CPU-only container images (such as
pytorch/pytorch:latest-cpu
) or build custom images without
NVIDIA dependencies.
SageMaker HyperPod AMI releases for Amazon EKS: July 15, 2025
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following updates:
SageMaker HyperPod AMI releases for Amazon EKS: June 09, 2025
SageMaker HyperPod DLAMI for Amazon EKS support
SageMaker HyperPod AMI releases for Amazon EKS: May 22, 2025
AMI general updates
SageMaker HyperPod DLAMI for Amazon EKS support
SageMaker HyperPod AMI releases for Amazon EKS: May 07, 2025
SageMaker HyperPod AMI releases for Amazon EKS: April 28, 2025
Improvements for K8s
-
Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the NVIDIA GPU Display Security Bulletin for April 2025
.
SageMaker HyperPod DLAMI for Amazon EKS support
SageMaker HyperPod AMI releases for Amazon EKS: April 18, 2025
AMI general updates
-
New SageMaker HyperPod AMI for Amazon EKS 1.32.1.
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following:
SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025
Improvements for K8s
-
Upgraded Nvidia container toolkit from version 1.17.3 to version 1.17.4.
-
Fixed the issue where customers were unable to connect to nodes after a reboot.
-
Upgraded Elastic Fabric Adapter (EFA) version from 1.37.0 to 1.38.0.
-
The EFA now includes the AWS OFI NCCL plugin, which is located in the
/opt/amazon/ofi-nccl
directory instead of the original/opt/aws-ofi-nccl/
path. If you need to update yourLD_LIBRARY_PATH
environment variable, make sure to modify the path to point to the new/opt/amazon/ofi-nccl
location for the OFI NCCL plugin. -
Removed the emacs package from these DLAMIs. You can install emacs from GNU emac.
SageMaker HyperPod DLAMI for Amazon EKS support
SageMaker HyperPod AMI releases for Amazon EKS: January 22, 2025
AMI general updates
-
New SageMaker HyperPod AMI for Amazon EKS 1.31.2.
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following:
SageMaker HyperPod AMI releases for Amazon EKS: December 21, 2024
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following:
SageMaker HyperPod AMI releases for Amazon EKS: December 13, 2024
SageMaker HyperPod DLAMI for Amazon EKS upgrade
-
Updated SSM Agent to version
3.3.1311.0
.
SageMaker HyperPod AMI releases for Amazon EKS: November 24, 2024
AMI general updates
-
Released in
MEL
(Melbourne) Region. -
Updated SageMaker HyperPod base DLAMI to the following versions:
-
Kubernetes: 2024-11-01.
-
SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following:
SageMaker HyperPod AMI releases for Amazon EKS: November 11, 2024
AMI general updates
-
Updated SageMaker HyperPod DLAMI with Amazon EKS versions 1.28.13, 1.29.8, 1.30.4.
SageMaker HyperPod AMI releases for Amazon EKS: October 21, 2024
AMI general updates
-
Updated SageMaker HyperPod base DLAMI to the following versions:
-
Amazon EKS: 1.28.11, 1.29.6, 1.30.2.
-
SageMaker HyperPod AMI releases for Amazon EKS: September 10, 2024
SageMaker HyperPod DLAMI for Amazon EKS support
The AMIs include the following: