Model deployment with Triton Inference Server

Focus mode

Model deployment with Triton Inference Server - Amazon SageMaker AI

Hosting modes Inference payload types Using config.pbtxt to set the model config Publishing default Triton metrics to Amazon CloudWatch Environment variables

Triton Inference Server is an open source inference serving software that streamlines AI inference. With Triton, you can deploy any model built with multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more.

The SageMaker AI Triton containers help you deploy Triton Inference Server on the SageMaker AI Hosting platform to serve trained models in production. It supports the different modes in which SageMaker AI operates. For a list of available Triton Inference Server containers available on SageMaker AI, see NVIDIA Triton Inference Containers (SM support only).

For end-to-end notebook examples, we recommend taking a look at the amazon-sagemaker-examples repository.

Hosting modes

The following SageMaker AI Hosting modes are supported by Triton containers:

Single model endpoints
- This is SageMaker AI’s default mode of operation. In this mode, the Triton container can load a single model, or a single ensemble model.
- The name of the model must be passed as as a property of the container environment, which is part of the CreateModel SageMaker AI API call. The environment variable used to pass in the model name is SAGEMAKER_TRITON_DEFAULT_MODEL_NAME.
Single model endpoints with ensemble
- Triton Inference Server supports ensemble, which is a pipeline, or a DAG (directed acyclic graph) of models. While an ensemble technically comprises of multiple models, in the default single model endpoint mode, SageMaker AI can treat the ensemble proper (the meta-model that represents the pipeline) as the main model to load, and can subsequently load the associated models.
- The ensemble proper’s model name must be used to load the model. It must be passed as a property of the container environment, which is part of the CreateModel SageMaker API call. The environment variable used to pass in the model name is SAGEMAKER_TRITON_DEFAULT_MODEL_NAME.
Multi-model endpoints
- In this mode, SageMaker AI can serve multiple models on a single endpoint. You can use this mode by specifying the environment variable ‘MultiModel’: true as a property of the container environment, which is part of the CreateModel SageMaker API call.
- By default, no model is loaded when the instance starts. To run an inference request against a particular model, specify the corresponding model's *.tar.gz file as an argument to the TargetModel property of the InvokeEndpoint SageMaker API call.
Multi-model endpoints with ensemble
- In this mode, SageMaker AI functions as described for multi-model endpoints. However, the SageMaker AI Triton container can load multiple ensemble models, meaning that multiple model pipelines can run on the same instance. SageMaker AI treats every ensemble as one model, and the ensemble proper of each model can be invoked by specifying the corresponding *.tar.gz archive as the TargetModel.
- For better memory management during dynamic memory LOAD and UNLOAD, we recommend that you keep the ensemble size small.

Inference payload types

Triton supports two methods of sending an inference payload over the network – json and binary+json (or binary encoded json). The JSON payload in both cases includes the datatype, shape and the actual inference request tensor. The request tensor must be a binary tensor.

With the binary+json format, you must specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. In the SageMaker AI Triton container, this is done using a custom Content-Type header: application/vnd.sagemaker-triton.binary+json;json-header-size={}. This is different from using the Inference-Header-Content-Length header on a stand-alone Triton Inference Server because custom headers are not allowed in SageMaker AI.

Using config.pbtxt to set the model config

For Triton Inference Servers on SageMaker AI, each model must include a config.pbtxt file that specifies, at a minimum, the following configurations for the model:

name: While this is optional for models running outside of SageMaker AI, we recommend that you always provide a name for the models to be run in Triton on SageMaker AI.
platform and/or backend: Setting a backend is essential to specify the type of the model. Some backends have further classification, such as tensorflow_savedmodel or tensorflow_graphdef. Such options can be specified as part of the platform key in addition to the backend key. The most common backends are tensorrt, onnxruntime, tensorflow, pytorch, python, dali, fil, and openvino.
input: Specify three attributes for the input: name, data_type and dims (the shape).
output: Specify three attributes for the output: name, data_type and dims (the shape).
max_batch_size: Set the batch size to a value greater than or equal to 1 that indicates the maximum batch size that Triton should use with the model.

For more details on configuring config.pbtxt, see Triton’s GitHub repository. Triton provides several configurations for tweaking model behavior. Some of the most common and important configuration options are:

instance_groups: Instance groups help with specifying the number and location for a given model. They have the attributes count, kind, and gpus (used when kind is KIND_GPU). The count attribute is equivalent to the number of workers. For regular model serving, each worker has its own copy of the model. Similarly, in Triton, the count specifies the number of model copies per device. For example, if the instance_group type is KIND_CPU, then the CPU has count number of model copies.

Note
On a GPU instance, the instance_group configuration applies per GPU device. For example, count number of model copies are placed on each GPU device unless you explicitly specify which GPU devices should load the model.
dynamic_batching and sequence_batching: Dynamic batching is used for stateless models, and sequence batching is used for stateful models (where you want to route a request to the same model instance every time). Batching schedulers enable a per-model queue, which help in increasing throughput, depending on the batching configuration.
ensemble: An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. It can be configured by specifying platform as ensemble. The ensemble configuration is just a representation of the model pipeline. On SageMaker AI, all the models under an ensemble are treated as dependents of the ensemble model and are counted as a single model for SageMaker AI metrics, such as LoadedModelCount.

Publishing default Triton metrics to Amazon CloudWatch

The NVIDIA Triton Inference Container exposes metrics at port 8002 (configurable) for the different models and GPUs that are utilized in the Triton Inference Server. For full details of the default metrics that are available, see the GitHub page for the Triton Inference Server metrics. These metrics are in Prometheus format and can be scraped using a Prometheus scraper configuration.

Starting with version v23.07 onwards, the SageMaker AI Triton container supports publishing these metrics to Amazon CloudWatch by specifying a few environment variables. In order to scrape the Prometheus metrics, the SageMaker AI Triton container leverages the Amazon CloudWatch agent.

The required environment variables that you must specify to collect metrics are as follows:

Environment variable	Description	Example value
`SAGEMAKER_TRITON_ALLOW_METRICS`	Specify this option to allow Triton to publish metrics to its Prometheus endpoint.	"true"
`SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH`	Specify this option to start the pre-checks necessary to publish metrics to Amazon CloudWatch.	"true"
`SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP`	Specify this option to point to the log group to which metrics are written.	"/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesTest"
`SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE`	Specify this option to point to the metric namespace where you want to see and plot the metrics.	"/aws/SageMaker AI/Endpoints/TritonMetrics/SageMakerTwoEnsemblesPublicTest"
`SAGEMAKER_TRITON_METRICS_PORT`	Specify this as 8002, or any other port. If SageMaker AI has not blocked the specified port, it is used. Otherwise, another non-blocked port is chosen automatically.	"8002"

When publishing metrics with Triton on SageMaker AI, keep in mind the following limitations:

While you can generate custom metrics through the C-API and Python backend (v23.05 onwards), these are currently not supported for publishing to Amazon CloudWatch.
In SageMaker AI multi-model endpoints (MME) mode, Triton runs in an environment that requires model namespacing to be enabled because each model (except ensemble models) are treated as if they are in their own model repository. Currently, this creates a limitation for metrics. When model namespacing is enabled, Triton does not distinguish the metrics between two models with the same name belonging to different ensembles. As a workaround, make sure that every model being deployed has a unique name. This also makes it easier to look up your metrics in CloudWatch.

Environment variables

The following table lists the supported environment variables for Triton on SageMaker AI.

Environment variable	Description	Type	Possible values
`SAGEMAKER_MULTI_MODEL`	Allows Triton to operate in SageMaker AI multi-model endpoints mode.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_DEFAULT_MODEL_NAME`	Specify the model to be loaded in the SageMaker AI single model (default) mode. For ensemble mode, specify the name of the ensemble proper.	String	`<model_name>` as specified in config.pbtxt
`SAGEMAKER_TRITON_PING_MODE`	`'ready'` is the default mode in SageMaker AI's single model mode, and `'live'` is the default in SageMaker AI's multi-model endpoints mode.	String	`ready`, `live`
`SAGEMAKER_TRITON_DISABLE_MODEL_NAMESPACING`	In the SageMaker AI Triton container, this is set to `true` by default.	Boolean	`true`, `false`
`SAGEMAKER_BIND_TO_PORT`	While on SageMaker AI, the default port is 8080. You can customize to a different port in multi-container scenarios.	String	`<port_number>`
`SAGEMAKER_SAFE_PORT_RANGE`	This is set by the SageMaker AI platform when using multi-container mode.	String	`<port_1>`–`<port_2>`
`SAGEMAKER_TRITON_ALLOW_GRPC`	While SageMaker AI doesn't support GRPC currently, if you're using Triton in front of a custom reverse proxy, you may choose to enable GRPC.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_GRPC_PORT`	The default port for GRPC is 8001, but you can change it.	String	`<port_number>`
`SAGEMAKER_TRITON_THREAD_COUNT`	You can set the number of default HTTP request handler threads.	String	`<number>`
`SAGEMAKER_TRITON_LOG_VERBOSE`	`true` by default on SageMaker AI, but you can selectively turn this option off.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_LOG_INFO`	`false` by default on SageMaker AI.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_LOG_WARNING`	`false` by default on SageMaker AI.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_LOG_ERROR`	`false` by default on SageMaker AI.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE`	Specify the shm size for the Python backend, in bytes. The default value is 16 MB but can be increased.	String	`<number>`
`SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE`	Specify the shm growth size for the Python backend, in bytes. The default value is 1 MB but can be increased to allow greater increments.	String	`<number>`
`SAGEMAKER_TRITON_TENSORFLOW_VERSION`	The default value is `2`. Triton no longer supports Tensorflow 2 from Triton v23.04. You can configure this variable for previous versions.	String	`<number>`
`SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT`	Restrict the maximum GPU memory percentage which is used for model loading, allowing the remainder to be used for the inference requests.	String	`<number>`
`SAGEMAKER_TRITON_ALLOW_METRICS`	`false` by default on SageMaker AI.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_METRICS_PORT`	The default port is 8002.	String	`<number>`
`SAGEMAKER_TRITON_PUBLISH_METRICS_TO_CLOUDWATCH`	`false` by default on SageMaker AI. Set this variable to `true` to allow pushing Triton default metrics to Amazon CloudWatch. If this option is enabled, you are responsible for CloudWatch costs when metrics are published to your account.	Boolean	`true`, `false`
`SAGEMAKER_TRITON_CLOUDWATCH_LOG_GROUP`	Required if you've enabled metrics publishing to CloudWatch.	String	`<cloudwatch_log_group_name>`
`SAGEMAKER_TRITON_CLOUDWATCH_METRIC_NAMESPACE`	Required if you've enabled metrics publishing to CloudWatch.	String	`<cloudwatch_metric_namespace>`
`SAGEMAKER_TRITON_ADDITIONAL_ARGS`	Appends any additional arguments when starting the Triton Server.	String	`<additional_args>`

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Deploy models with DJL Serving

Model deployment at the edge

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Model deployment with Triton Inference Server

Hosting modes

Inference payload types

Using config.pbtxt to set the model config

Note

Publishing default Triton metrics to Amazon CloudWatch

Environment variables

On this page

Did this page help you?

Next topic:

Previous topic:

Need help?