View a markdown version of this page

Deploy a generative AI inference recommendation - Amazon SageMaker AI

Deploy a generative AI inference recommendation

When a recommendation job completes, each recommendation includes a deployment-ready configuration. You can deploy the chosen configuration to a SageMaker AI inference endpoint with a single action from SageMaker AI Studio, or programmatically through the API.

Understanding deployment configurations

Each recommendation in the job response contains a DeploymentConfiguration object with the following information:

ImageUri

The container image URI optimized for the recommended instance type.

InstanceType

The recommended instance type for deployment.

InstanceCount

The number of instances needed to meet the performance target.

CopyCountPerInstance

The number of model copies to run per instance. When set to a value greater than one, multiple copies of the model are loaded on each instance to increase throughput.

EnvironmentVariables

Environment variables configured for optimal performance, such as tensor parallel size and maximum sequence length.

S3

S3 channel references for model artifacts, including any optimized model outputs.

Deploy using the API

To deploy a recommendation programmatically, use the model package from the recommendation to create a SageMaker AI model and endpoint. Each recommendation includes a ModelDetails object with the model package ARN and inference specification name. This is the simplest deployment path because the model package already contains the container image, environment variables, and model artifact channels.

import boto3 client = boto3.client("sagemaker", region_name="us-west-2") # Get the recommendation from a completed job response = client.describe_ai_recommendation_job( AIRecommendationJobName="my-recommendation-job" ) # Select a recommendation (e.g., the first one) recommendation = response["Recommendations"][0] model_details = recommendation["ModelDetails"] deploy_config = recommendation["DeploymentConfiguration"] # Create a model from the model package. # The model package already contains the container image, environment # variables, and S3 data channels (base model + optimization artifacts). model_name = "my-recommended-model" container_def = { "ModelPackageName": model_details["ModelPackageArn"], } # If the recommendation uses a named inference specification (e.g., for # a specific optimization variant), specify it so SageMaker selects the # correct container and instance configuration from the model package. if model_details.get("InferenceSpecificationName"): container_def["InferenceSpecificationName"] = model_details["InferenceSpecificationName"] client.create_model( ModelName=model_name, PrimaryContainer=container_def, ExecutionRoleArn="arn:aws:iam::111122223333:role/ExampleRole", ) # Create an endpoint configuration endpoint_config_name = "my-recommended-endpoint-config" production_variant = { "VariantName": "AllTraffic", "ModelName": model_name, "InstanceType": deploy_config["InstanceType"], "InitialInstanceCount": deploy_config.get("InstanceCount", 1), } copy_count = deploy_config.get("CopyCountPerInstance") if copy_count and copy_count > 1: production_variant["InferenceAmiVersion"] = "al2-ami-sagemaker-inference-gpu-2" production_variant["RoutingConfig"] = {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"} client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[production_variant], ) # Create the endpoint endpoint_name = "my-recommended-endpoint" client.create_endpoint( EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name, ) print(f"Endpoint {endpoint_name} is being created.")

After the endpoint is created, you can monitor its status using the DescribeEndpoint API until it reaches InService status.

import time while True: response = client.describe_endpoint(EndpointName=endpoint_name) status = response["EndpointStatus"] print(f"Endpoint status: {status}") if status in ("InService", "Failed"): break time.sleep(60)

Deploy from SageMaker AI Studio

You can also deploy a recommended configuration directly from SageMaker AI Studio with a single action. In SageMaker AI Studio, navigate to the completed recommendation job, review the recommendations and their performance metrics, and choose the configuration you want to deploy.