Supported features

Amazon SageMaker AI offers the following four options to deploy models for inference.

Real-time inference for inference workloads with real-time, interactive, low latency requirements.
Batch transform for offline inference with large datasets.
Asynchronous inference for near-real-time inference with large inputs that require longer preprocessing times.
Serverless inference for inference workloads that have idle periods between traffic spurts.

The following table summarizes the core platform features that are supported by each inference option. It does not show features that can be provided by frameworks, custom Docker containers, or through chaining different AWS services.

Feature	Real-time inference	Batch transform	Asynchronous inference	Serverless inference	Docker containers
Autoscaling support	✓	N/A	✓	✓	N/A
GPU support	✓¹	✓¹	✓¹		1P, pre-built, BYOC
Single model	✓	✓	✓	✓	N/A
Multi-model endpoint	✓				k-NN, XGBoost, Linear Learner, RCF, TensorFlow, Apache MXNet, PyTorch, scikit-learn ²
Multi-container endpoint	✓				1P, pre-built, Extend pre-built, BYOC
Serial inference pipeline	✓	✓			1P, pre-built, Extend pre-built, BYOC
Inference Recommender	✓				1P, pre-built, Extend pre-built, BYOC
Private link support	✓	✓	✓		N/A
Data capture/Model monitor support	✓	✓			N/A
DLCs supported	1P, pre-built, Extend pre-built, BYOC	1P, pre-built, Extend pre-built, BYOC	1P, pre-built, Extend pre-built, BYOC	1P, pre-built, Extend pre-built, BYOC	N/A
Protocols supported	HTTP(S)	HTTP(S)	HTTP(S)	HTTP(S)	N/A
Payload size	< 6 MB	≤ 100 MB	≤ 1 GB	≤ 4 MB
HTTP chunked encoding	Framework dependent, 1P not supported	N/A	Framework dependent, 1P not supported	Framework dependent, 1P not supported	N/A
Request timeout	< 60 seconds	Days	< 1 hour	< 60 seconds	N/A
Deployment guardrails: blue/green deployments	✓	N/A	✓		N/A
Deployment guardrails: rolling deployments	✓	N/A	✓		N/A
Shadow testing	✓				N/A
Scale to zero		N/A	✓	✓	N/A
Market place model packages support	✓	✓			N/A
Virtual private cloud support	✓	✓	✓		N/A
Multiple production variants support	✓				N/A
Network isolation	✓		✓		N/A
Model parallel serving support	✓³	✓	✓³		✓³
Volume encryption	✓	✓	✓	✓	N/A
Customer AWS KMS	✓	✓	✓	✓	N/A
d instance support	✓	✓	✓		N/A
inf1 support	✓				✓

With SageMaker AI, you can deploy a single model, or multiple models behind a single inference endpoint for real-time inference. The following table summarizes the core features supported by various hosting options that come with real-time inference.

Feature	Single model endpoints	Multi-model endpoints	Serial inference pipeline	Multi-container endpoints
Autoscaling support	✓	✓	✓	✓
GPU support	✓¹	✓	✓
Single model	✓	✓	✓	✓
Multi-model endpoints		✓	✓	N/A
Multi-container endpoints	✓			N/A
Serial inference pipeline	✓	✓	N/A
Inference Recommender	✓
Private link support	✓	✓	✓	✓
Data capture/Model monitor support	✓	N/A	N/A	N/A
DLCs supported	1P, pre-built, Extend pre-built, BYOC	k-NN, XGBoost, Linear Learner, RCF, TensorFlow, Apache MXNet, PyTorch, scikit-learn ²	1P, pre-built, Extend pre-built, BYOC	1P, pre-built, Extend pre-built, BYOC
Protocols supported	HTTP(S)	HTTP(S)	HTTP(S)	HTTP(S)
Payload size	< 6 MB	< 6 MB	< 6 MB	< 6 MB
Request timeout	< 60 seconds	< 60 seconds	< 60 seconds	< 60 seconds
Deployment guardrails: blue/green deployments	✓	✓	✓	✓
Deployment guardrails: rolling deployments	✓	✓	✓	✓
Shadow testing	✓
Market place model packages support	✓
Virtual private cloud support	✓	✓	✓	✓
Multiple production variants support	✓		✓	✓
Network isolation	✓	✓	✓	✓
Model parallel serving support	✓ ³		✓ ³
Volume encryption	✓	✓	✓	✓
Customer AWS KMS	✓	✓	✓	✓
d instance support	✓	✓	✓	✓
inf1 support	✓

¹ Availability of the Amazon EC2 instance types depends on the AWS Region. For availability of instances specific to AWS, see Amazon SageMaker AI Pricing.

² To use any other framework or algorithm, use the SageMaker AI Inference toolkit to build a container that supports multi-model endpoints.

³ With SageMaker AI, you can deploy large models (up to 500 GB) for inference. You can configure the container health check and download timeout quotas, up to 60 minutes. This will allow you to have more time to download and load your model and associated resources. For more information, see SageMaker AI endpoint parameters for large model inference. You can use SageMaker AI compatible large model Inference containers. You can also use third-party model parallelization libraries, such as Triton with FasterTransformer and DeepSpeed. You have to ensure that they are compatible with SageMaker AI.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Updating containers for the NVIDIA Container Toolkit

Resources