Large model inference FAQs - Amazon SageMaker

Large model inference FAQs

Refer to the following FAQ items for answers to commonly asked questions about large model inference (LMI) with SageMaker.

A: Large model inference deep learning containers (LMI DLCs) include tested versions of popular model parallelization and optimization libraries for hosting models with tens or hundreds of billions of parameters. You should use an LMI DLC if your deep learning model requires multiple accelerators for hosting or if you want to accelerate inference with a popular, supported model such as GPT, Bloom, and OPT.

A: If, after reviewing the documentation, you continue to have problems hosting your large model, contact AWS Support.

A: AWS LMI DLCs have been tested on and designed for SageMaker, but can be used on Amazon EC2 with supported instance types.

AWS LMI DLCs include a convenience function to load a Hugging Face model format. You can bring models in another format, such as Megatron checkpoint, but you need to write logic to load that format or convert to a Hugging Face format.

If you're using the DJL handler (like in the tutorial example), then switch the default engine and handler in the serving.properties file. For example, modify the engine and option.entryPoint settings to switch from DeepSpeed to Hugging Face. In general, you can switch the engine and use the same tensor parallel degree.