Use Your Own Inference Code with Hosting Services
This section explains how Amazon SageMaker interacts with a Docker container that runs your own inference code for hosting services. Use this information to write inference code and create a Docker image.
Topics
How SageMaker Runs Your Inference Image
To configure a container to run as an executable, use an ENTRYPOINT
instruction in a Dockerfile. Note the following:
-
For model inference, SageMaker runs the container as:
docker run
image
serveSageMaker overrides default
CMD
statements in a container by specifying theserve
argument after the image name. Theserve
argument overrides arguments that you provide with theCMD
command in the Dockerfile. -
We recommend that you use the
exec
form of theENTRYPOINT
instruction:ENTRYPOINT ["executable", "param1", "param2"]
For example:
ENTRYPOINT ["python", "k_means_inference.py"]
The
exec
form of theENTRYPOINT
instruction starts the executable directly, not as a child of/bin/sh
. This enables it to receive signals likeSIGTERM
andSIGKILL
from the SageMaker API operations, which is a requirement.For example, when you use the
CreateEndpoint
API to create an endpoint, SageMaker provisions the number of ML compute instances required by the endpoint configuration, which you specify in the request. SageMaker runs the Docker container on those instances.If you reduce the number of instances backing the endpoint (by calling the
UpdateEndpointWeightsAndCapacities
API), SageMaker runs a command to stop the Docker container on the instances that are being terminated. The command sends theSIGTERM
signal, then it sends theSIGKILL
signal thirty seconds later.If you update the endpoint (by calling the
UpdateEndpoint
API), SageMaker launches another set of ML compute instances and runs the Docker containers that contain your inference code on them. Then it runs a command to stop the previous Docker containers. To stop a Docker container, command sends theSIGTERM
signal, then it sends theSIGKILL
signal 30 seconds later. -
SageMaker uses the container definition that you provided in your
CreateModel
request to set environment variables and the DNS hostname for the container as follows:-
It sets environment variables using the
ContainerDefinition.Environment
string-to-string map. -
It sets the DNS hostname using the
ContainerDefinition.ContainerHostname
.
-
-
If you plan to use GPU devices for model inferences (by specifying GPU-based ML compute instances in your
CreateEndpointConfig
request), make sure that your containers arenvidia-docker
compatible. Don't bundle NVIDIA drivers with the image. For more information aboutnvidia-docker
, see NVIDIA/nvidia-docker. -
You can't use the
tini
initializer as your entry point in SageMaker containers because it gets confused by thetrain
andserve
arguments.
How SageMaker Loads Your Model Artifacts
In your CreateModel
request, the container definition includes the
ModelDataUrl
parameter, which identifies the S3 location where model
artifacts are stored. SageMaker uses this information to determine from where to copy the
model artifacts. It copies the artifacts to the /opt/ml/model
directory for use by your inference code.
The ModelDataUrl
must point to a tar.gz file. Otherwise, SageMaker won't
download the file.
If you trained your model in SageMaker, the model artifacts are saved as a single compressed tar file in Amazon S3. If you trained your model outside SageMaker, you need to create this single compressed tar file and save it in a S3 location. SageMaker decompresses this tar file into /opt/ml/model directory before your container starts.
How Containers Serve Requests
Containers need to implement a web server that responds to /invocations
and /ping
on port 8080.
How Your Container Should Respond to Inference Requests
To obtain inferences, the client application sends a POST request to the SageMaker
endpoint. For more information, see the InvokeEndpoint
API. SageMaker passes the request to the container,
and returns the inference result from the container to the client. Note the
following:
-
SageMaker strips all
POST
headers except those supported byInvokeEndpoint
. SageMaker might add additional headers. Inference containers must be able to safely ignore these additional headers. -
To receive inference requests, the container must have a web server listening on port 8080 and must accept
POST
requests to the/invocations
endpoint. -
A customer's model containers must accept socket connection requests within 250 ms.
-
A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to the
/invocations
. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.
How Your Container Should Respond to Health Check (Ping) Requests
SageMaker launches new inference containers in the following situations:
-
Responding to
CreateEndpoint
,UpdateEndpoint
, andUpdateEndpointWeightsAndCapacities
API calls -
Security patching
-
Replacing unhealthy instances
Soon after container startup, SageMaker starts sending periodic GET requests to the
/ping
endpoint.
The simplest requirement on the container is to respond with an HTTP 200 status code
and an empty body. This indicates to SageMaker that the container is ready to accept
inference requests at the /invocations
endpoint.
If the container does not begin to pass health checks, by consistently responding with
200s, during the 4 minutes after startup, new instance launch will fail. This will cause
CreateEndPoint
to fail, leaving the endpoint in a failed state. The
update requested by UpdateEndpoint
will not be completed, security patches
would not be applied, and unhealthy instances will not be replaced.
While the minimum bar is for the container to return a static 200, a container
developer can use this functionality to perform deeper checks. The request timeout on
/ping
attempts is 2 seconds.