MLSUS-14: Deploy multiple models behind a single endpoint

Host multiple models behind a single endpoint to improve endpoint utilization. Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.

Implementation plan

Amazon SageMaker provides three methods to deploy multiple models to a single endpoint:

Host multiple models in one container behind one endpoint (MLCOST-24) - SageMaker multi-model endpoints (MME) are served using a single container. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time. This can help cut inference costs and reduce carbon emissions by up to 90% .
Host multiple models which use different containers behind one endpoint (MLCOST-24) – SageMaker multi-container endpoint (MCE) support deploying up to 15 containers that use different models or framework on a single endpoint, and invoking them independently or in sequence for low-latency inference and cost savings. The models can be completely heterogenous, with their own independent serving stack.
Use SageMaker inference pipelines - An inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of containers deployed behind a single endpoint. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. The output from the one container is passed as input to the next. When defining the containers for a pipeline model, you also specify the order in which the containers are run.

Documents

Blogs

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

MLSUS-13: Optimize models for inference

Monitoring lifecycle phase