Deploy models for real-time inference

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. This section shows how you can use real-time inferencing to obtain predictions interactively from your model.

To deploy the model that produced the best validation metric in an Autopilot experiment, you have several options. For example, when using Autopilot in SageMaker Studio Classic, you can deploy the model automatically or manually. You can also use SageMaker APIs to manually deploy an Autopilot model.

The following tabs show three options for deploying your model. These instructions assume that you have already created a model in Autopilot. If you don't have a model, see Create Regression or Classification Jobs for Tabular Data Using the AutoML API. To see examples for each option, open each tab.

The Autopilot UI contains helpful dropdown menus, toggles, tooltips, and more to help you navigate through model deployment. You can deploy using either one of the following procedures: Automatic or Manual.

Automatic Deployment: To automatically deploy the best model from an Autopilot experiment to an endpoint
1. Create an experiment in SageMaker Studio Classic.
2. Toggle the Auto deploy value to Yes.
  
  Note
  Automatic deployment will fail if either the default resource quota or your customer quota for endpoint instances in a Region is too limited. In hyperparameter optimization (HPO) mode, you are required to have at least two ml.m5.2xlarge instances. In ensembling mode, you are required to have at least one ml.m5.12xlarge instance. If you encounter a failure related to quotas, you can request a service limit increase for SageMaker AI endpoint instances.
Manual Deployment: To manually deploy the best model from an Autopilot experiment to an endpoint
1. Create an experiment in SageMaker Studio Classic.
2. Toggle the Auto deploy value to No.
3. Select the model that you want to deploy under Model name.
4. Select the orange Deployment and advanced settings button located on the right of the leaderboard. This opens a new tab.
5. Configure the endpoint name, instance type, and other optional information.
6. Select the orange Deploy model to deploy to an endpoint.
7. Check the progress of the endpoint creation process in the https://console.aws.amazon.com/sagemaker/ by navigating to the Endpoints section. That section is located in the Inference dropdown menu in the navigation panel.
8. After the endpoint status changes from Creating to InService, as shown below, return to Studio Classic and invoke the endpoint.

You can also obtain real-time inference by deploying your model using API calls. This section shows the five steps of this process using AWS Command Line Interface (AWS CLI) code snippets.

For complete code examples for both AWS CLI commands and AWS SDK for Python (boto3), open the tabs directly following these steps.

Obtain candidate definitions

Obtain the candidate container definitions from InferenceContainers. These candidate definitions are used to create a SageMaker AI model.

The following example uses the DescribeAutoMLJob API to obtain candidate definitions for the best model candidate. See the following AWS CLI command as an example.
```
aws sagemaker describe-auto-ml-job --auto-ml-job-name <job-name> --region <region>
```
List candidates

The following example uses the ListCandidatesForAutoMLJob API to list all candidates. See the following AWS CLI command as an example.
```
aws sagemaker list-candidates-for-auto-ml-job --auto-ml-job-name <job-name> --region <region>
```

Create a SageMaker AI model

Use the container definitions from the previous steps to create a SageMaker AI model by using the CreateModel API. See the following AWS CLI command as an example.


aws sagemaker create-model --model-name '<your-custom-model-name>' \
                    --containers ['<container-definition1>, <container-definition2>, <container-definition3>]' \
                    --execution-role-arn '<execution-role-arn>' --region '<region>

Create an endpoint configuration

The following example uses the CreateEndpointConfig API to create an endpoint configuration. See the following AWS CLI command as an example.


aws sagemaker create-endpoint-config --endpoint-config-name '<your-custom-endpoint-config-name>' \
                    --production-variants '<list-of-production-variants>' \
                    --region '<region>'

Create the endpoint

The following AWS CLI example uses the CreateEndpoint API to create the endpoint.
```
aws sagemaker create-endpoint --endpoint-name '<your-custom-endpoint-name>' \
                    --endpoint-config-name '<endpoint-config-name-you-just-created>' \
                    --region '<region>'
```
Check the progress of your endpoint deployment by using the DescribeEndpoint API. See the following AWS CLI command as an example.
```
aws sagemaker describe-endpoint —endpoint-name '<endpoint-name>' —region <region>
```
After the EndpointStatus changes to InService, the endpoint is ready to use for real-time inference.

Invoke the endpoint

The following command structure invokes the endpoint for real-time inferencing.


aws sagemaker invoke-endpoint --endpoint-name '<endpoint-name>' \ 
                  --region '<region>' --body '<your-data>' [--content-type] '<content-type>' <outfile>

The following tabs contain complete code examples for deploying a model with AWS SDK for Python (boto3) or the AWS CLI.

AWS SDK for Python (boto3)

Obtain the candidate definitions by using the following code example.


import sagemaker 
import boto3

session = sagemaker.session.Session()

sagemaker_client = boto3.client('sagemaker', region_name='us-west-2')
job_name = 'test-auto-ml-job'

describe_response = sm_client.describe_auto_ml_job(AutoMLJobName=job_name)
# extract the best candidate definition from DescribeAutoMLJob response
best_candidate = describe_response['BestCandidate']
# extract the InferenceContainers definition from the caandidate definition
inference_containers = best_candidate['InferenceContainers']

Create the model by using the following the code example.


# Create Model
model_name = 'test-model' 
sagemaker_role = 'arn:aws:iam:444455556666:role/sagemaker-execution-role'
create_model_response = sagemaker_client.create_model(
   ModelName = model_name,
   ExecutionRoleArn = sagemaker_role,
   Containers = inference_containers 
)

Create the endpoint configuration by using the following the code example.


endpoint_config_name = 'test-endpoint-config'
                                                        
instance_type = 'ml.m5.2xlarge' 
# for all supported instance types, see 
# https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html#sagemaker-Type-ProductionVariant-InstanceType    # Create endpoint config

endpoint_config_response = sagemaker_client.create_endpoint_config(
   EndpointConfigName=endpoint_config_name, 
   ProductionVariants=[
       {
           "VariantName": "variant1",
           "ModelName": model_name, 
           "InstanceType": instance_type,
           "InitialInstanceCount": 1
       }
   ]
)

print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")

Create the endpoint and deploy the model with the following code example.


# create endpoint and deploy the model
endpoint_name = 'test-endpoint'
create_endpoint_response = sagemaker_client.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name)
print(create_endpoint_response)

Check the status of creating the endpoint by using the following the code example.


# describe endpoint creation status
status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]

Invoke the endpoint for real-time inferencing by using the following command structure.


# once endpoint status is InService, you can invoke the endpoint for inferencing
if status == "InService":
  sm_runtime = boto3.Session().client('sagemaker-runtime')
  inference_result = sm_runtime.invoke_endpoint(EndpointName='test-endpoint', ContentType='text/csv', Body='1,2,3,4,class')

AWS Command Line Interface (AWS CLI)

Obtain the candidate definitions by using the following code example.


aws sagemaker describe-auto-ml-job --auto-ml-job-name 'test-automl-job' --region us-west-2

Create the model by using the following code example.


aws sagemaker create-model --model-name 'test-sagemaker-model'
--containers '[{
    "Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:2.5-1-cpu-py3", amzn-s3-demo-bucket1
    "ModelDataUrl": "s3://amzn-s3-demo-bucket/output/model.tar.gz",
    "Environment": {
        "AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF": "1",
        "AUTOML_TRANSFORM_MODE": "feature-transform",
        "SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "application/x-recordio-protobuf",
        "SAGEMAKER_PROGRAM": "sagemaker_serve",
        "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
    }
}, {
    "Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3",
    "ModelDataUrl": "s3://amzn-s3-demo-bucket/output/model.tar.gz",
    "Environment": {
        "MAX_CONTENT_LENGTH": "20971520",
        "SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv",
        "SAGEMAKER_INFERENCE_OUTPUT": "predicted_label", 
        "SAGEMAKER_INFERENCE_SUPPORTED": "predicted_label,probability,probabilities" 
    }
}, {
    "Image": "348316444620.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:2.5-1-cpu-py3", aws-region
    "ModelDataUrl": "s3://amzn-s3-demo-bucket/output/model.tar.gz", 
    "Environment": { 
        "AUTOML_TRANSFORM_MODE": "inverse-label-transform", 
        "SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv", 
        "SAGEMAKER_INFERENCE_INPUT": "predicted_label", 
        "SAGEMAKER_INFERENCE_OUTPUT": "predicted_label", 
        "SAGEMAKER_INFERENCE_SUPPORTED": "predicted_label,probability,labels,probabilities", 
        "SAGEMAKER_PROGRAM": "sagemaker_serve", 
        "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code"
    } 
}]' \
--execution-role-arn 'arn:aws:iam::1234567890:role/sagemaker-execution-role' \ 
--region 'us-west-2'

For additional details, see creating a model.

The create model command will return a response in the following format.


{
    "ModelArn": "arn:aws:sagemaker:us-west-2:1234567890:model/test-sagemaker-model"
}

Create an endpoint configuration by using the following code example.


aws sagemaker create-endpoint-config --endpoint-config-name 'test-endpoint-config' \
--production-variants '[{"VariantName": "variant1", 
                        "ModelName": "test-sagemaker-model",
                        "InitialInstanceCount": 1,
                        "InstanceType": "ml.m5.2xlarge"
                       }]' \
--region us-west-2

The create endpoint configuration command will return a response in the following format.


{
    "EndpointConfigArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint-config/test-endpoint-config"
}

Create an endpoint by using the following code example.


aws sagemaker create-endpoint --endpoint-name 'test-endpoint' \    
--endpoint-config-name 'test-endpoint-config' \                 
--region us-west-2

The create endpoint command will return a response in the following format.


{
    "EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint"
}

Check the progress of the endpoint deployment by using the following describe-endpoint CLI code example.


aws sagemaker describe-endpoint --endpoint-name 'test-endpoint' --region us-west-2

The previous progress check will return a response in the following format.


{
    "EndpointName": "test-endpoint",
    "EndpointArn": "arn:aws:sagemaker:us-west-2:1234567890:endpoint/test-endpoint",
    "EndpointConfigName": "test-endpoint-config",
    "EndpointStatus": "Creating",
    "CreationTime": 1660251167.595,
    "LastModifiedTime": 1660251167.595
}

After the EndpointStatus changes to InService, the endpoint is ready for use in real-time inference.

Invoke the endpoint for real-time inferencing by using the following command structure.


aws sagemaker-runtime invoke-endpoint --endpoint-name 'test-endpoint' \
--region 'us-west-2' \
--body '1,51,3.5,1.4,0.2' \
--content-type 'text/csv' \
'/tmp/inference_output'

For more options, see invoking an endpoint.

You can deploy an Autopilot model from a different account than the original account that a model was generated in. To implement cross-account model deployment, this section shows how to do the following:

Grant permission to the deploying account

To assume the role in the generating account, you must grant permission to the deploying account. This allows the deploying account to describe Autopilot jobs in the generating account.

The following example uses a generating account with a trusted sagemaker-role entity. The example shows how to give a deploying account with the ID 111122223333 permission to assume the role of the generating account.


"Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ],
                "AWS": [ "111122223333"]
            },
            "Action": "sts:AssumeRole"
        }

The new account with the ID 111122223333 can now assume the role for the generating account.

Next, call the DescribeAutoMLJob API from the deploying account to obtain a description of the job created by the generating account.

The following code example describes the model from the deploying account.


import sagemaker 
import boto3
session = sagemaker.session.Session()

sts_client = boto3.client('sts')
sts_client.assume_role

role = 'arn:aws:iam::111122223333:role/sagemaker-role'
role_session_name = "role-session-name"
_assumed_role = sts_client.assume_role(RoleArn=role, RoleSessionName=role_session_name)

credentials = _assumed_role["Credentials"]
access_key = credentials["AccessKeyId"]
secret_key = credentials["SecretAccessKey"]
session_token = credentials["SessionToken"]

session = boto3.session.Session()
        
sm_client = session.client('sagemaker', region_name='us-west-2', 
                           aws_access_key_id=access_key,
                            aws_secret_access_key=secret_key,
                            aws_session_token=session_token)

# now you can call describe automl job created in account A 

job_name = "test-job"
response= sm_client.describe_auto_ml_job(AutoMLJobName=job_name)

Grant access to the deploying account to the model artifacts in the generating account.

The deploying account only needs access to the model artifacts in the generating account to deploy it. These are located in the S3OutputPath that was specified in the original CreateAutoMLJob API call during model generation.

To give the deploying account access to the model artifacts, choose one of the following options:
1. Give access to the ModelDataUrl from the generating account to the deploying account.
  
  Next, you need to give the deploying account permission to assume the role. follow the real-time inferencing steps to deploy.
2. Copy model artifacts from the generating account's original S3OutputPath to the generating account.
  
  To grant access to the model artifacts, you must define a best_candidate model and reassign model containers to the new account.
  
  The following example shows how to define a best_candidate model and reassign the ModelDataUrl.
```
best_candidate = automl.describe_auto_ml_job()['BestCandidate']

# reassigning ModelDataUrl for best_candidate containers below
new_model_locations = ['new-container-1-ModelDataUrl', 'new-container-2-ModelDataUrl', 'new-container-3-ModelDataUrl']
new_model_locations_index = 0
for container in best_candidate['InferenceContainers']:
    container['ModelDataUrl'] = new_model_locations[new_model_locations_index++]        
```
  After this assignment of containers, follow the steps in Deploy using SageMaker APIs to deploy.

To build a payload in real-time inferencing, see the notebook example to define a test payload. To create the payload from a CSV file and invoke an endpoint, see the Predict with your model section in Create a machine learning model automatically.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Model deployment and prediction

Run batch inference jobs

Deploy models for real-time inference

Note