Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Deploy models for inference

Focus mode
Deploy models for inference - Amazon SageMaker AI

With Amazon SageMaker AI, you can start getting predictions, or inferences, from your trained machine learning models. SageMaker AI provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker AI Inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden. SageMaker AI provides you with various inference options, such as real-time endpoints for getting low latency inference, serverless endpoints for fully managed infrastructure and auto-scaling, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient and model deployment and inference.

Choosing a feature

There are several use cases for deploying ML models with SageMaker AI. This section describes those use cases, as well as the SageMaker AI feature we recommend for each use case.

Use cases

The following are the main uses cases for deploying ML models with SageMaker AI.

  • Use case 1: Deploy a machine learning model in a low-code or no-code environment. For beginners or those new to SageMaker AI, you can deploy pre-trained models using Amazon SageMaker JumpStart through the Amazon SageMaker Studio interface, without the need for complex configurations.

  • Use case 2: Use code to deploy machine learning models with more flexibility and control. Experienced ML practitioners can deploy their own models with customized settings for their application needs using the ModelBuilder class in the SageMaker AI Python SDK, which provides fine-grained control over various settings, such as instance types, network isolation, and resource allocation.

  • Use case 3: Deploy machine learning models at scale. For advanced users and organizations who want to manage models at scale in production, use the AWS SDK for Python (Boto3) and AWS CloudFormation along with your desired Infrastructure as Code (IaC) and CI/CD tools to provision resources and automate resource management.

The following table describes key considerations and tradeoffs for SageMaker AI features corresponding with each use case.

Use case 1 Use case 2 Use case 3
SageMaker AI feature Use JumpStart in Studio to accelerate your foundational model deployment. Deploy models using ModelBuilder from the SageMaker Python SDK. Deploy and manage models at scale with AWS CloudFormation.
Description Use the Studio UI to deploy pre-trained models from a catalog to pre-configured inference endpoints. This option is ideal for citizen data scientists, or for anyone who wants to deploy a model without configuring complex settings. Use the ModelBuilder class from the Amazon SageMaker AI Python SDK to deploy your own model and configure deployment settings. This option is ideal for experienced data scientists, or for anyone who has their own model to deploy and requires fine-grained control. Use AWS CloudFormation and Infrastructure as Code (IaC) for programmatic control and automation for deploying and managing SageMaker AI models. This option is ideal for advanced users who require consistent and repeatable deployments.
Optimized for Fast and streamlined deployments of popular open source models Deploying your own models Ongoing management of models in production
Considerations Lack of customization for container settings and specific application needs No UI, requires that you're comfortable developing and maintaining Python code Requires infrastructure management and organizational resources, and also requires familiarity with the AWS SDK for Python (Boto3) or with AWS CloudFormation templates.
Recommended environment A SageMaker AI domain A Python development environment configured with your AWS credentials and the SageMaker Python SDK installed, or a SageMaker AI IDE such as SageMaker JupyterLab The AWS CLI, a local development environment, and Infrastructure as Code (IaC) and CI/CD tools

Additional options

SageMaker AI provides different options for your inference use cases, giving you choice over the technical breadth and depth of your deployments:

  • Deploying a model to an endpoint. When deploying your model, consider the following options:

    • Real-time inference. Real-time inference is ideal for inference workloads where you have interactive, low latency requirements.

    • Deploy models with Amazon SageMaker Serverless Inference. Use Serverless Inference to deploy models without configuring or managing any of the underlying infrastructure. This option is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.

    • Asynchronous inference. queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up toAsynchronous Inference one hour), and near real-time latency requirements

  • Cost optimization. To optimize your inference costs, consider the following options:

    • Model performance optimization with SageMaker Neo. Use SageMaker Neo to optimize and run your machine learning models with better performance and efficiency, helping you to minimize compute costs by automatically optimizing models to run in environments like AWS Inferentia chips.

    • Automatic scaling of Amazon SageMaker AI models. Use autoscaling to dynamically adjust the compute resources for your endpoints based on incoming traffic patterns, which helps you optimize costs by only paying for the resources you're using at a given time.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.