Custom Inference Code with Hosting Services
This section explains how Amazon SageMaker AI interacts with a Docker container that runs your own inference code for hosting services. Use this information to write inference code and create a Docker image.
Topics
How SageMaker AI Runs Your Inference Image
To configure a container to run as an executable, use an ENTRYPOINT
instruction in a Dockerfile. Note the following:
-
For model inference, SageMaker AI runs the container as:
docker runimageserveSageMaker AI overrides default
CMDstatements in a container by specifying theserveargument after the image name. Theserveargument overrides arguments that you provide with theCMDcommand in the Dockerfile. -
SageMaker AI expects all containers to run with root users. Create your container so that it uses only root users. When SageMaker AI runs your container, users that do not have root-level access can cause permissions issues.
-
We recommend that you use the
execform of theENTRYPOINTinstruction:ENTRYPOINT ["executable", "param1", "param2"]For example:
ENTRYPOINT ["python", "k_means_inference.py"]The
execform of theENTRYPOINTinstruction starts the executable directly, not as a child of/bin/sh. This enables it to receive signals likeSIGTERMandSIGKILLfrom the SageMaker API operations, which is a requirement.For example, when you use the
CreateEndpointAPI to create an endpoint, SageMaker AI provisions the number of ML compute instances required by the endpoint configuration, which you specify in the request. SageMaker AI runs the Docker container on those instances.If you reduce the number of instances backing the endpoint (by calling the
UpdateEndpointWeightsAndCapacitiesAPI), SageMaker AI runs a command to stop the Docker container on the instances that are being terminated. The command sends theSIGTERMsignal, then it sends theSIGKILLsignal thirty seconds later.If you update the endpoint (by calling the
UpdateEndpointAPI), SageMaker AI launches another set of ML compute instances and runs the Docker containers that contain your inference code on them. Then it runs a command to stop the previous Docker containers. To stop a Docker container, command sends theSIGTERMsignal, then it sends theSIGKILLsignal 30 seconds later. -
SageMaker AI uses the container definition that you provided in your
CreateModelrequest to set environment variables and the DNS hostname for the container as follows:-
It sets environment variables using the
ContainerDefinition.Environmentstring-to-string map. -
It sets the DNS hostname using the
ContainerDefinition.ContainerHostname.
-
-
If you plan to use GPU devices for model inferences (by specifying GPU-based ML compute instances in your
CreateEndpointConfigrequest), make sure that your containers arenvidia-dockercompatible. Don't bundle NVIDIA drivers with the image. For more information aboutnvidia-docker, see NVIDIA/nvidia-docker. -
You can't use the
tiniinitializer as your entry point in SageMaker AI containers because it gets confused by thetrainandservearguments.
How SageMaker AI Loads Your Model Artifacts
In your CreateModel API request, you can use either the
ModelDataUrl or S3DataSource parameter to identify the S3
location where model artifacts are stored. SageMaker AI copies your model artifacts from the S3
location to the /opt/ml/model directory for use by your inference
code. Your container has read-only access to /opt/ml/model. Do not
write to this directory.
The ModelDataUrl must point to a tar.gz file. Otherwise, SageMaker AI won't
download the file.
If you trained your model in SageMaker AI, the model artifacts are saved as a single compressed tar file in Amazon S3. If you trained your model outside SageMaker AI, you need to create this single compressed tar file and save it in a S3 location. SageMaker AI decompresses this tar file into /opt/ml/model directory before your container starts.
For deploying large models, we recommend that you follow Deploying uncompressed models.
How Your Container Should Respond to Inference Requests
To obtain inferences, the client application sends a POST request to the SageMaker AI endpoint. SageMaker AI passes the request to the container, and returns the inference result from the container to the client.
For more information about the inference requests that your container will receive, see the following actions in the Amazon SageMaker AI API Reference:
Requirements for inference containers
To respond to inference requests, your container must meet the following requirements:
-
SageMaker AI strips all
POSTheaders except those supported byInvokeEndpoint. SageMaker AI might add additional headers. Inference containers must be able to safely ignore these additional headers. -
To receive inference requests, the container must have a web server listening on port 8080 and must accept
POSTrequests to the/invocationsand/pingendpoints. -
A customer's model containers must accept socket connection requests within 250 ms.
-
A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to the
/invocations. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.
Example invocation functions
The following examples demonstrate how the code in your container can process inference requests. These examples handle requests that client applications send by using the InvokeEndpoint action.
Example invocation functions for streaming requests
The following examples demonstrate how the code in your inference container can process streaming inference requests. These examples handle requests that client applications send by using the InvokeEndpointWithResponseStream action.
When a container handles a streaming inference request, it returns the model's inference as a series of parts incrementally as the model generates them. Client applications start receiving responses immediately when they're available. They don't need to wait for the model to generate the entire response. You can implement streaming to support fast interactive experiences, such as chatbots, virtual assistants, and music generators.
How Your Container Should Respond to Health Check (Ping) Requests
SageMaker AI launches new inference containers in the following situations:
-
Responding to
CreateEndpoint,UpdateEndpoint, andUpdateEndpointWeightsAndCapacitiesAPI calls -
Security patching
-
Replacing unhealthy instances
Soon after container startup, SageMaker AI starts sending periodic GET requests to the
/ping endpoint.
The simplest requirement on the container is to respond with an HTTP 200 status code
and an empty body. This indicates to SageMaker AI that the container is ready to accept
inference requests at the /invocations endpoint.
If the container does not begin to pass health checks by consistently responding with
200s during the 8 minutes after startup, the new instance launch fails. This causes
CreateEndpoint to fail, leaving the endpoint in a failed state. The
update requested by UpdateEndpoint isn't completed, security patches
aren't applied, and unhealthy instances aren't replaced.
While the minimum bar is for the container to return a static 200, a container
developer can use this functionality to perform deeper checks. The request timeout on
/ping attempts is 2 seconds.