Model parallelism and large model inference
State-of-the-art deep learning models for applications such as natural language processing (NLP) are large, typically with tens or hundreds of billions of parameters. Larger models are often more accurate, which makes them attractive to machine learning practitioners. However, these models are often too large to fit on a single accelerator or GPU device, making it difficult to achieve low-latency inference. You can avoid this memory bottleneck by using model parallelism techniques to partition a model across multiple accelerators or GPUs.
Amazon SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI). In the following sections, you can find resources to get started with LMI on SageMaker.
Topics
- Deep learning containers for large model inference
- SageMaker endpoint parameters for large model inference
- Large model inference tutorials
- Configurations and settings
- Choosing instance types for large model inference
- Deploying uncompressed models
- Large model inference FAQs
- Large model inference troubleshooting
- Release notes for large model inference deep learning containers