Deploy models for inference - Amazon SageMaker

Deploy models for inference

With Amazon SageMaker, you can start getting predictions, or inferences, from your trained machine learning models. SageMaker provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker Inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden. SageMaker provides you with various inference options, such as real-time endpoints for getting low latency inference, serverless endpoints for fully managed infrastructure and auto-scaling, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient and model deployment and inference.

Choosing a feature

There are several use cases for deploying ML models with SageMaker. This section describes those use cases, as well as the SageMaker feature we recommend for each use case.

Use cases

The following are the main uses cases for deploying ML models with SageMaker.

  • Use case 1: Deploy a machine learning model in a low-code or no-code environment. For beginners or those new to SageMaker, you can deploy pre-trained models using Amazon SageMaker JumpStart through the Amazon SageMaker Studio interface, without the need for complex configurations.

  • Use case 2: Use code to deploy machine learning models with more flexibility and control. Experienced ML practitioners can deploy their own models with customized settings for their application needs using the ModelBuilder class in the SageMaker Python SDK, which provides fine-grained control over various settings, such as instance types, network isolation, and resource allocation.

  • Use case 3: Deploy machine learning models at scale. For advanced users and organizations who want to manage models at scale in production, use the AWS SDK for Python (Boto3) and AWS CloudFormation along with your desired Infrastructure as Code (IaC) and CI/CD tools to provision resources and automate resource management.

The following table describes key considerations and tradeoffs for SageMaker features corresponding with each use case.

Use case 1 Use case 2 Use case 3
SageMaker feature Use JumpStart in Studio to accelerate your foundational model deployment. Deploy models using ModelBuilder from the SageMaker Python SDK. Deploy and manage models at scale with AWS CloudFormation.
Description Use the Studio UI to deploy pre-trained models from a catalog to pre-configured inference endpoints. This option is ideal for citizen data scientists, or for anyone who wants to deploy a model without configuring complex settings. Use the ModelBuilder class from the Amazon SageMaker Python SDK to deploy your own model and configure deployment settings. This option is ideal for experienced data scientists, or for anyone who has their own model to deploy and requires fine-grained control. Use AWS CloudFormation and Infrastructure as Code (IaC) for programmatic control and automation for deploying and managing SageMaker models. This option is ideal for advanced users who require consistent and repeatable deployments.
Optimized for Fast and streamlined deployments of popular open source models Deploying your own models Ongoing management of models in production
Considerations Lack of customization for container settings and specific application needs No UI, requires that you're comfortable developing and maintaining Python code Requires infrastructure management and organizational resources, and also requires familiarity with the AWS SDK for Python (Boto3) or with AWS CloudFormation templates.
Recommended environment A SageMaker domain A Python development environment configured with your AWS credentials and the SageMaker Python SDK installed, or a SageMaker IDE such as SageMaker JupyterLab The AWS CLI, a local development environment, and Infrastructure as Code (IaC) and CI/CD tools

Additional options

SageMaker provides different options for your inference use cases, giving you choice over the technical breadth and depth of your deployments:

  • Deploying a model to an endpoint. When deploying your model, consider the following options:

    • Real-time inference. Real-time inference is ideal for inference workloads where you have interactive, low latency requirements.

    • Deploy models with Amazon SageMaker Serverless Inference. Use Serverless Inference to deploy models without configuring or managing any of the underlying infrastructure. This option is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.

    • Asynchronous inference. queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up toAsynchronous Inference one hour), and near real-time latency requirements

  • Cost optimization. To optimize your inference costs, consider the following options:

    • Optimize model performance using Neo. Use SageMaker Neo to optimize and run your machine learning models with better performance and efficiency, helping you to minimize compute costs by automatically optimizing models to run in environments like AWS Inferentia chips.

    • Automatically Scale Amazon SageMaker Models. Use autoscaling to dynamically adjust the compute resources for your endpoints based on incoming traffic patterns, which helps you optimize costs by only paying for the resources you're using at a given time.