Bring your own dependencies for multi-model endpoints on CPU backed instances Bring your own dependencies for multi-model endpoints on GPU backed instances Use the SageMaker AI Inference Toolkit

Build Your Own Container for SageMaker AI Multi-Model Endpoints

Refer to the following sections for bringing your own container and dependencies to multi-model endpoints.

Topics

Bring your own dependencies for multi-model endpoints on CPU backed instances
Bring your own dependencies for multi-model endpoints on GPU backed instances
Use the SageMaker AI Inference Toolkit
Custom Containers Contract for Multi-Model Endpoints

Bring your own dependencies for multi-model endpoints on CPU backed instances

If none of the pre-built container images serve your needs, you can build your own container for use with CPU backed multi-model endpoints.

Custom Amazon Elastic Container Registry (Amazon ECR) images deployed in Amazon SageMaker AI are expected to adhere to the basic contract described in Custom Inference Code with Hosting Services that govern how SageMaker AI interacts with a Docker container that runs your own inference code. For a container to be capable of loading and serving multiple models concurrently, there are additional APIs and behaviors that must be followed. This additional contract includes new APIs to load, list, get, and unload models, and a different API to invoke models. There are also different behaviors for error scenarios that the APIs need to abide by. To indicate that the container complies with the additional requirements, you can add the following command to your Docker file:


LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

SageMaker AI also injects an environment variable into the container


SAGEMAKER_MULTI_MODEL=true

If you are creating a multi-model endpoint for a serial inference pipline, your Docker file must have the required labels for both multi-models and serial inference pipelines. For more information about serial information pipelines, see Run Real-time Predictions with an Inference Pipeline.

To help you implement these requirements for a custom container, two libraries are available:

Multi Model Server is an open source framework for serving machine learning models that can be installed in containers to provide the front end that fulfills the requirements for the new multi-model endpoint container APIs. It provides the HTTP front end and model management capabilities required by multi-model endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and performs inference on a specified loaded model. It also provides a pluggable backend that supports a pluggable custom backend handler where you can implement your own algorithm.
SageMaker AI Inference Toolkit is a library that bootstraps Multi Model Server with a configuration and settings that make it compatible with SageMaker AI multi-model endpoints. It also allows you to tweak important performance parameters, such as the number of workers per model, depending on the needs of your scenario.

Bring your own dependencies for multi-model endpoints on GPU backed instances

The bring your own container (BYOC) capability on multi-model endpoints with GPU backed instances is not currently supported by the Multi Model Server and SageMaker AI Inference Toolkit libraries.

For creating multi-model endpoints with GPU backed instances, you can use the SageMaker AI supported NVIDIA Triton Inference Server. with the NVIDIA Triton Inference Containers. To bring your own dependencies, you can build your own container with the SageMaker AI supported NVIDIA Triton Inference Server as the base image to your Docker file:


FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3

Important

Containers with the Triton Inference Server are the only supported containers you can use for GPU backed multi-model endpoints.

Use the SageMaker AI Inference Toolkit

Note

The SageMaker AI Inference Toolkit is only supported for CPU backed multi-model endpoints. The SageMaker AI Inference Toolkit is not currently not supported for GPU backed multi-model endpoints.

Pre-built containers that support multi-model endpoints are listed in Supported algorithms, frameworks, and instances for multi-model endpoints. If you want to use any other framework or algorithm, you need to build a container. The easiest way to do this is to use the SageMaker AI Inference Toolkit to extend an existing pre-built container. The SageMaker AI inference toolkit is an implementation for the multi-model server (MMS) that creates endpoints that can be deployed in SageMaker AI. For a sample notebook that shows how to set up and deploy a custom container that supports multi-model endpoints in SageMaker AI, see the Multi-Model Endpoint BYOC Sample Notebook.

Note

The SageMaker AI inference toolkit supports only Python model handlers. If you want to implement your handler in any other language, you must build your own container that implements the additional multi-model endpoint APIs. For information, see Custom Containers Contract for Multi-Model Endpoints.

To extend a container by using the SageMaker AI inference toolkit

Create a model handler. MMS expects a model handler, which is a Python file that implements functions to pre-process, get preditions from the model, and process the output in a model handler. For an example of a model handler, see model_handler.py from the sample notebook.

Import the inference toolkit and use its model_server.start_model_server function to start MMS. The following example is from the dockerd-entrypoint.py file from the sample notebook. Notice that the call to model_server.start_model_server passes the model handler described in the previous step:


import subprocess
import sys
import shlex
import os
from retrying import retry
from subprocess import CalledProcessError
from sagemaker_inference import model_server

def _retry_if_error(exception):
    return isinstance(exception, CalledProcessError or OSError)

@retry(stop_max_delay=1000 * 50,
       retry_on_exception=_retry_if_error)
def _start_mms():
    # by default the number of workers per model is 1, but we can configure it through the
    # environment variable below if desired.
    # os.environ['SAGEMAKER_MODEL_SERVER_WORKERS'] = '2'
    model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')

def main():
    if sys.argv[1] == 'serve':
        _start_mms()
    else:
        subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))

    # prevent docker exit
    subprocess.call(['tail', '-f', '/dev/null'])
    
main()

In your Dockerfile, copy the model handler from the first step and specify the Python file from the previous step as the entrypoint in your Dockerfile. The following lines are from the Dockerfile used in the sample notebook:


# Copy the default custom service file to handle incoming data and inference requests
COPY model_handler.py /home/model-server/model_handler.py

# Define an entrypoint script for the docker image
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]

Build and register your container. The following shell script from the sample notebook builds the container and uploads it to an Amazon Elastic Container Registry repository in your AWS account:


%%sh

# The name of our algorithm
algorithm_name=demo-sagemaker-multimodel

cd container

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -q -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

You can now use this container to deploy multi-model endpoints in SageMaker AI.

Topics

Custom Containers Contract for Multi-Model Endpoints

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Add or Remove Models

API Container Contract