SageMaker AI Model Deployment upgrades GPU drivers on the ML instances for Real-time, Batch, and Asynchronous Inference options over time to provide customers access to improvements from the driver providers. Below you can see the GPU version supported for each Inference option. Different driver versions can change how your model interacts with the GPUs. Below are some strategies to help you understand how your application works with different driver versions.
Current versions and supported instance families
Amazon SageMaker AI Inference supports the following drivers and instance families:
Service | GPU | Driver version | CUDA version | Instance types |
---|---|---|---|---|
Real-time | NVIDIA | 470 | CUDA 11.4 | ml.p2.*, ml.p3.*, ml.p4d.*, ml.p4de.*, ml.g4dn.*, ml.g5.* |
535 | CUDA 12.2 | ml.p5.*, ml.g6.* | ||
550 | CUDA 12.4 | ml.p5e.*, ml.p5en.* | ||
Asynchronous Inference | NVIDIA | 470 | CUDA 11.4 | ml.p2.*, ml.p3.*, ml.p4d.*, ml.p4de.*, ml.g4dn.*, ml.g5* |
470 | CUDA 12.2 | ml.p5.*, ml.g6.* | ||
550 | CUDA 12.4 | ml.p5e.*, ml.p5en.* | ||
Batch | NVIDIA | 470 | CUDA 11.4 | ml.p2.*, ml.p3.*, ml.p4d.*, ml.p4de.*, ml.g4dn.*, ml.g5* |
Troubleshoot your model container with GPU capabilities
If you encounter an issue when running your GPU workload, see the following guidance:
Run the nvidia-smi
(NVIDIA System Management
Interface) command from within the Docker container. If the NVIDIA System Management
Interface detects a GPU detection error or NVIDIA initialization error, it will
return the following error message:
Failed to initialize NVML: Driver/library version mismatch
Based on your use case, follow these best practices to resolve the failure or error:
Follow the best practice recommendation described in the If you bring your own (BYO) model containers dropdown.
Follow the best practice recommendation described in the If you use a CUDA compatibility layer dropdown.
Refer to the NVIDIA System Management Interface page
If your GPU instance uses NVIDIA driver versions that are not compatible with the CUDA version in the Docker container, then deploying an endpoint will fail with the following error message:
Failure reason CannotStartContainerError. Please ensure the model container for variant <variant_name> starts correctly when invoked with 'docker run <image> serve'
Based on your use case, follow these best practices to resolve the failure or error:
Follow the best practice recommendation described in the The driver my container depends on is greater than the version on the ML GPU instances dropdown.
Follow the best practice recommendation described in the If you use a CUDA compatibility layer dropdown.
Best practices for working with mismatched driver versions
The following provides information on how to update your GPU driver:
No action is required. NVIDIA provides backwards compatibility.
If it is a minor version difference, no action is required. NVIDIA provides minor version forward compatibility.
If it is a major version difference, the CUDA Compatibility Package will need to be
installed. Please refer to
CUDA Compatibility Package
Important
The CUDA Compatibility Package is not backwards compatible so it needs to be disabled if the driver version on the instance is greater than the CUDA Compatibility Package version.
Ensure no NVIDIA driver packages are bundled in the image which could cause conflict with on host NVIDIA driver version.
To verify if the platform Nvidia driver version supports the CUDA Compatibility Package version installed
in the model container, see the CUDA documentationENTRYPOINT
script).
The script demonstrates how to dynamically switch the use of the CUDA Compatibility Package based on the detected Nvidia driver version on the deployed host for your model container. When SageMaker releases a newer Nvidia driver version, the installed CUDA Compatibility Package can be turned off automatically if the CUDA application is supported natively on the new driver.
#!/bin/bash
verlt() {
[ "$1" = "$2" ] && return 1 || [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ]
}
if [ -f /usr/local/cuda/compat/libcuda.so.1 ]; then
CUDA_COMPAT_MAX_DRIVER_VERSION=$(readlink /usr/local/cuda/compat/libcuda.so.1 | cut -d'.' -f 3-)
echo "CUDA compat package should be installed for NVIDIA driver smaller than ${CUDA_COMPAT_MAX_DRIVER_VERSION}"
NVIDIA_DRIVER_VERSION=$(sed -n 's/^NVRM.*Kernel Module *\([0-9.]*\).*$/\1/p' /proc/driver/nvidia/version 2>/dev/null || true)
echo "Current installed NVIDIA driver version is ${NVIDIA_DRIVER_VERSION}"
if verlt $NVIDIA_DRIVER_VERSION $CUDA_COMPAT_MAX_DRIVER_VERSION; then
echo "Adding CUDA compat to LD_LIBRARY_PATH"
export LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH
else
echo "Skipping CUDA compat setup as newer NVIDIA driver is installed"
fi
else
echo "Skipping CUDA compat setup as package not found"
fi