Model deployment with Triton Inference Server
Triton Inference
Server
The SageMaker Triton containers help you deploy Triton Inference Server
on the SageMaker Hosting platform to serve trained models in production. It supports the
different modes in which SageMaker operates. For a list of available Triton Inference Server
containers available on SageMaker, see
NVIDIA Triton Inference Containers (SM support only)
For end-to-end notebook examples, we recommend taking a look at the
amazon-sagemaker-examples repository
Hosting modes
The following SageMaker Hosting modes are supported by Triton containers:
-
Single model endpoints
-
This is SageMaker’s default mode of operation. In this mode, the Triton container can load a single model, or a single ensemble model.
-
The name of the model must be passed as as a property of the container environment, which is part of the
CreateModel
SageMaker API call. The environment variable used to pass in the model name isSAGEMAKER_TRITON_DEFAULT_MODEL_NAME
.
-
-
Single model endpoints with ensemble
-
Triton Inference Server supports ensemble, which is a pipeline, or a DAG (directed acyclic graph) of models. While an ensemble technically comprises of multiple models, in the default single model endpoint mode, SageMaker can treat the ensemble proper (the meta-model that represents the pipeline) as the main model to load, and can subsequently load the associated models.
-
The ensemble proper’s model name must be used to load the model. It must be passed as a property of the container environment, which is part of the
CreateModel
SageMaker API call. The environment variable used to pass in the model name isSAGEMAKER_TRITON_DEFAULT_MODEL_NAME
.
-
-
Multi-model endpoints
-
In this mode, SageMaker can serve multiple models on a single endpoint. You can use this mode by specifying the environment variable
‘MultiModel’: true
as a property of the container environment, which is part of theCreateModel
SageMaker API call. -
By default, no model is loaded when the instance starts. To run an inference request against a particular model, specify the corresponding model's
*.tar.gz
file as an argument to theTargetModel
property of theInvokeEndpoint
SageMaker API call.
-
-
Multi-model endpoints with ensemble
-
In this mode, SageMaker functions as described for multi-model endpoints. However, the SageMaker Triton container can load multiple ensemble models, meaning that multiple model pipelines can run on the same instance. SageMaker treats every ensemble as one model, and the ensemble proper of each model can be invoked by specifying the corresponding
*.tar.gz
archive as theTargetModel
. For better memory management during dynamic memory
LOAD
andUNLOAD
, we recommend that you keep the ensemble size small.
-
Inference payload types
Triton supports two methods of sending an inference payload over the
network – json
and
binary+json
(or binary encoded json). The JSON
payload in both cases includes the datatype, shape and the actual inference
request tensor. The request tensor must be a binary tensor.
With the binary+json
format, you must specify the length of the
request metadata in the header to allow Triton to correctly parse the binary
payload. In the SageMaker Triton container, this is done using a custom
Content-Type
header:
application/vnd.sagemaker-triton.binary+json;json-header-size={}
.
This is different from using the Inference-Header-Content-Length
header
on a stand-alone Triton Inference Server because custom headers are not allowed in
SageMaker.
Using config.pbtxt to set the model config
For Triton Inference Servers on SageMaker, each model must include a
config.pbtxt
file that specifies, at a minimum, the following
configurations for the model:
-
name
: While this is optional for models running outside of SageMaker, we recommend that you always provide a name for the models to be run in Triton on SageMaker. -
platform
and/orbackend
: Setting a backend is essential to specify the type of the model. Some backends have further classification, such as tensorflow_savedmodel
ortensorflow_graphdef
. Such options can be specified as part of theplatform
key in addition to thebackend
key. The most common backends aretensorrt
,onnxruntime
,tensorflow
,pytorch
,python
,dali
,fil
, andopenvino
. -
input
: Specify three attributes for the input:name
,data_type
anddims
(the shape). -
output
: Specify three attributes for the output:name
,data_type
anddims
(the shape). -
max_batch_size
: Set the batch size to a value greater than or equal to 1 that indicates the maximum batch size that Triton should use with the model.
For more details on configuring config.pbtxt
, see Triton’s
GitHub repository
-
instance_groups
: Instance groups help with specifying the number and location for a given model. They have the attributes count
,kind
, andgpus
(used whenkind
isKIND_GPU
). Thecount
attribute is equivalent to the number of workers. For regular model serving, each worker has its own copy of the model. Similarly, in Triton, thecount
specifies the number of model copies per device. For example, if theinstance_group
type isKIND_CPU
, then the CPU hascount
number of model copies.Note
On a GPU instance, the
instance_group
configuration applies per GPU device. For example,count
number of model copies are placed on each GPU device unless you explicitly specify which GPU devices should load the model. -
dynamic_batching
and sequence_batching
: Dynamic batching is used for stateless models, and sequence batching is used for stateful models (where you want to route a request to the same model instance every time). Batching schedulers enable a per-model queue, which help in increasing throughput, depending on the batching configuration. -
ensemble
: An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. It can be configured by specifying platform
asensemble
. The ensemble configuration is just a representation of the model pipeline. On SageMaker, all the models under an ensemble are treated as dependents of the ensemble model and are counted as a single model for SageMaker metrics, such asLoadedModelCount
.
Publishing default Triton metrics to Amazon CloudWatch
The NVIDIA Triton Inference Container exposes metrics at port 8002 (configurable)
for the different models and GPUs that are utilized in the Triton Inference Server.
For full details of the default metrics that are available, see the GitHub page for
the Triton Inference Server metrics
Starting with version v23.07 onwards, the SageMaker Triton container supports publishing these metrics to Amazon CloudWatch by specifying a few environment variables. In order to scrape the Prometheus metrics, the SageMaker Triton container leverages the Amazon CloudWatch agent.
The required environment variables that you must specify to collect metrics are as follows:
Environment variable | Description | Example value |
---|---|---|
|
Specify this option to allow Triton to publish metrics to its Prometheus endpoint. |
"true" |
|
Specify this option to start the pre-checks necessary to publish metrics to Amazon CloudWatch. |
"true" |
|
Specify this option to point to the log group to which metrics are written. |
"/aws/SageMaker/Endpoints/TritonMetrics/SageMakerTwoEnsemblesTest" |
|
Specify this option to point to the metric namespace where you want to see and plot the metrics. |
"/aws/SageMaker/Endpoints/TritonMetrics/SageMakerTwoEnsemblesPublicTest" |
|
Specify this as 8002, or any other port. If SageMaker has not blocked the specified port, it is used. Otherwise, another non-blocked port is chosen automatically. |
"8002" |
When publishing metrics with Triton on SageMaker, keep in mind the following limitations:
While you can generate custom metrics through the C-API and Python backend (v23.05 onwards), these are currently not supported for publishing to Amazon CloudWatch.
In SageMaker multi-model endpoints (MME) mode, Triton runs in an environment that requires model namespacing to be enabled because each model (except ensemble models) are treated as if they are in their own model repository. Currently, this creates a limitation for metrics. When model namespacing is enabled, Triton does not distinguish the metrics between two models with the same name belonging to different ensembles. As a workaround, make sure that every model being deployed has a unique name. This also makes it easier to look up your metrics in CloudWatch.
Environment variables
The following table lists the supported environment variables for Triton on SageMaker.
Environment variable | Description | Type | Possible values |
---|---|---|---|
|
Allows Triton to operate in SageMaker multi-model endpoints mode. |
Boolean |
|
|
Specify the model to be loaded in the SageMaker single model (default) mode. For ensemble mode, specify the name of the ensemble proper. |
String |
|
|
|
String |
|
|
In the SageMaker Triton container, this is set to |
Boolean |
|
|
While on SageMaker, the default port is 8080. You can customize to a different port in multi-container scenarios. |
String |
|
|
This is set by the SageMaker platform when using multi-container mode. |
String |
|
|
While SageMaker doesn't support GRPC currently, if you're using Triton in front of a custom reverse proxy, you may choose to enable GRPC. |
Boolean |
|
|
The default port for GRPC is 8001, but you can change it. |
String |
|
|
You can set the number of default HTTP request handler threads. |
String |
|
|
|
Boolean |
|
|
|
Boolean |
|
|
|
Boolean |
|
|
|
Boolean |
|
|
Specify the shm size for the Python backend, in bytes. The default value is 16 MB but can be increased. |
String |
|
|
Specify the shm growth size for the Python backend, in bytes. The default value is 1 MB but can be increased to allow greater increments. |
String |
|
|
The default value is |
String |
|
|
Restrict the maximum GPU memory percentage which is used for model loading, allowing the remainder to be used for the inference requests. |
String |
|
|
|
Boolean |
|
|
The default port is 8002. |
String |
|
|
|
Boolean |
|
|
Required if you've enabled metrics publishing to CloudWatch. |
String |
|
|
Required if you've enabled metrics publishing to CloudWatch. |
String |
|
|
Appends any additional arguments when starting the Triton Server. |
String |
|