Deploy models for inference

With Amazon SageMaker AI, you can start getting predictions, or inferences, from your trained machine learning models. SageMaker AI provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker AI Inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden. SageMaker AI provides you with various inference options, such as real-time endpoints for getting low latency inference, serverless endpoints for fully managed infrastructure and auto-scaling, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient and model deployment and inference.

Choosing a feature

There are several use cases for deploying ML models with SageMaker AI. This section describes those use cases, as well as the SageMaker AI feature we recommend for each use case.

Use cases

The following are the main uses cases for deploying ML models with SageMaker AI.

Use case 1: Deploy a machine learning model in a low-code or no-code environment. For beginners or those new to SageMaker AI, you can deploy pre-trained models using Amazon SageMaker JumpStart through the Amazon SageMaker Studio interface, without the need for complex configurations.
Use case 2: Use code to deploy machine learning models with more flexibility and control. Experienced ML practitioners can deploy their own models with customized settings for their application needs using the ModelBuilder class in the SageMaker AI Python SDK, which provides fine-grained control over various settings, such as instance types, network isolation, and resource allocation.
Use case 3: Deploy machine learning models at scale. For advanced users and organizations who want to manage models at scale in production, use the AWS SDK for Python (Boto3) and AWS CloudFormation along with your desired Infrastructure as Code (IaC) and CI/CD tools to provision resources and automate resource management.

Recommended features

The following table describes key considerations and tradeoffs for SageMaker AI features corresponding with each use case.

	Use case 1	Use case 2	Use case 3
SageMaker AI feature	Use JumpStart in Studio to accelerate your foundational model deployment.	Deploy models using ModelBuilder from the SageMaker Python SDK.	Deploy and manage models at scale with AWS CloudFormation.
Description	Use the Studio UI to deploy pre-trained models from a catalog to pre-configured inference endpoints. This option is ideal for citizen data scientists, or for anyone who wants to deploy a model without configuring complex settings.	Use the `ModelBuilder` class from the Amazon SageMaker AI Python SDK to deploy your own model and configure deployment settings. This option is ideal for experienced data scientists, or for anyone who has their own model to deploy and requires fine-grained control.	Use AWS CloudFormation and Infrastructure as Code (IaC) for programmatic control and automation for deploying and managing SageMaker AI models. This option is ideal for advanced users who require consistent and repeatable deployments.
Optimized for	Fast and streamlined deployments of popular open source models	Deploying your own models	Ongoing management of models in production
Considerations	Lack of customization for container settings and specific application needs	No UI, requires that you're comfortable developing and maintaining Python code	Requires infrastructure management and organizational resources, and also requires familiarity with the AWS SDK for Python (Boto3) or with AWS CloudFormation templates.
Recommended environment	A SageMaker AI domain	A Python development environment configured with your AWS credentials and the SageMaker Python SDK installed, or a SageMaker AI IDE such as SageMaker JupyterLab	The AWS CLI, a local development environment, and Infrastructure as Code (IaC) and CI/CD tools

Additional options

SageMaker AI provides different options for your inference use cases, giving you choice over the technical breadth and depth of your deployments:

Deploying a model to an endpoint. When deploying your model, consider the following options:
- Real-time inference. Real-time inference is ideal for inference workloads where you have interactive, low latency requirements.
- Deploy models with Amazon SageMaker Serverless Inference. Use Serverless Inference to deploy models without configuring or managing any of the underlying infrastructure. This option is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.
- Asynchronous inference. queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up toAsynchronous Inference one hour), and near real-time latency requirements
Cost optimization. To optimize your inference costs, consider the following options:
- Model performance optimization with SageMaker Neo. Use SageMaker Neo to optimize and run your machine learning models with better performance and efficiency, helping you to minimize compute costs by automatically optimizing models to run in environments like AWS Inferentia chips.
- Automatic scaling of Amazon SageMaker AI models. Use autoscaling to dynamically adjust the compute resources for your endpoints based on incoming traffic patterns, which helps you optimize costs by only paying for the resources you're using at a given time.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Cluster repairs for GPU errors

Model Deployment