Deep learning containers for large model inference - Amazon SageMaker

Deep learning containers for large model inference

SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. With these DLCs you can use third party libraries such as DeepSpeed, Accelerate, and FasterTransformer to partition model parameters using model parallelism techniques to leverage the memory of multiple GPUs for inference. The following table lists the DLCs available with SageMaker for large model inference (LMI). We recommend that you start with these DLCs for LMI on SageMaker. These include components, libraries, and drivers that have been optimized and tested for use on SageMaker.

DLC URI Libraries Tuning parameters
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.23.0-fastertransformer5.3.0-cu118 PyTorch 2.0.1 https://github.com/deepjavalibrary/djl-serving/blob/0.23.0-dlc/serving/docs/configurations_large_model_inference_containers.md#fastertransformer-doc
DJL Serving 0.23.0
FasterTransformer 5.3.0
Hugging Face Accelerate 0.20.3
Hugging Face Transformers 4.30.1
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118 PyTorch 2.0.1 https://github.com/deepjavalibrary/djl-serving/blob/0.23.0-dlc/serving/docs/configurations_large_model_inference_containers.md#deepspeed-doc
DJL Serving 0.23.0
DeepSpeed 0.9.5
Hugging Face Accelerate 0.20.3
Hugging Face Transformers 4.30.2
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.23.0-neuronx-sdk2.12.0 PyTorch 1.13.1 https://github.com/deepjavalibrary/djl-serving/blob/0.23.0-dlc/serving/docs/configurations_large_model_inference_containers.md#transformers-neuronx-doc
DJL Serving 0.23.0
TransformersNeuronX 0.4.60
AWS Neuron SDK 2.12.0
NeuronX Distributed 0.1.0
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.21.0-fastertransformer5.3.0-cu117 PyTorch 1.12.1 https://github.com/deepjavalibrary/djl-serving/blob/0.21.0-dlc/serving/docs/configurations.md
DJL Serving 0.21.0
FasterTransformer 5.3.0
Hugging Face Accelerate 0.15.0
Hugging Face Transformers 4.25.1
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu117 PyTorch 1.12.1 https://github.com/deepjavalibrary/djl-serving/blob/0.19.0-dlc/serving/docs/configurations.md
DJL Serving 0.21.0
DeepSpeed 0.8.0
Hugging Face Accelerate 0.16.0
Hugging Face Transformers 4.26.0
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.20.0-deepspeed0.7.5-cu116 PyTorch 1.12.1 https://github.com/deepjavalibrary/djl-serving/blob/0.20.0-dlc/serving/docs/configurations.md
DJL Serving 0.20.0
DeepSpeed 0.7.5
Hugging Face Accelerate 0.13.2
Hugging Face Transformers 4.23.1
Hugging Face Diffusers 0.7.2
763104351884.dkr.ecr.region.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113 PyTorch 1.12.1 https://github.com/deepjavalibrary/djl-serving/blob/0.19.0-dlc/serving/docs/configurations.md
DJL Serving 0.19.0
DeepSpeed 0.7.3
Hugging Face Accelerate 0.13.2
Hugging Face Transformers 4.22.1

In addition to PyTorch, LMI DLCs include libraries to facilitate LMI. SageMaker supports the following categories of libraries.

  • Model zoo – Model zoos provide simple API access to pre-trained models. SageMaker provides the following model zoos:

    • Hugging Face Transformers is a popular library for pre-trained deep learning models that use a transformer architecture such as GPT, OPT, and BLOOM.

    • Hugging Face Diffusers is a library with pre-trained deep learning models that use a diffusion technique such as Stable Diffusion.

  • Model parallelism and inference optimization libraries – These libraries handle model parallel inference by partitioning a model artifact so that its comprising parameters can be spread across multiple GPUs. SageMaker supports the following model parallelism and inference optimization libraries:

    • DeepSpeed Inference is an open-source inference optimization library. It includes model partitioning schemes for model parallelism with supported models, including many transformer models. It also has optimized kernels for popular models such as OPT, GPT, and BLOOM that can significantly improve inference latency. The version of DeepSpeed in the LMI DLCs is optimized and tested to work on SageMaker. It includes several enhancements, including support for BF16 precision models.

    • Hugging Face Accelerate in an open-source model parallel inference library. It supports model parallelism for most models in the Hugging Face Transformers library.

    • FasterTransformer is an open source library from Nvidia that provides an accelerated engine for efficiently running transformer-based neural network inference. It has been designed to handle large models that require multiple GPUs and nodes in a distributed manner. The library includes an optimized version of the transformer block, which comprises both the encoder and decoder parts, enabling you to run the inference of full encoder-decoder architectures like T5, as well as encoder-only models like BERT and decoder-only models like GPT.

  • Model server – Model servers handle an inference request end to end. They accept requests, invoke pre-processing and post-processing scripts, and respond to users. Model servers that are compatible with model parallelism, also organize workers and threads across multiple devices. SageMaker supports the following model servers:

    • DJL-Serving is an open-source, high-performance model server powered by DJL. It takes multiple deep learning models or workflows, and makes them available through an HTTP endpoint. Versions 0.19 and above are supported by SageMaker and work with Amazon EC2 instances with multiple GPUs to facilitate LMI with model parallelism.

Supported instance types

AWS LMI DLCs support the p4d, p3, g5, and g4dn instance types.