Model parallelism and large model inference - Amazon SageMaker

Model parallelism and large model inference

State-of-the-art deep learning models for applications such as natural language processing (NLP) are large, typically with tens or hundreds of billions of parameters. Larger models are often more accurate, which makes them attractive to machine learning practitioners. However, these models are often too large to fit on a single accelerator or GPU device, making it difficult to achieve low-latency inference. You can avoid this memory bottleneck by using model parallelism techniques to partition a model across multiple accelerators or GPUs.

Amazon SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI). In the following sections, you can find resources to get started with LMI on SageMaker.