Prerequisites Scaling policy overview Scale based on a schedule Minimum and maximum scaling limits Cooldown period Permissions Service-linked role Related resources

Auto scaling overview

The following overview provides details on the prerequisites and components used for auto scaling.

Topics

Prerequisites
Scaling policy overview
Scale based on a schedule
Minimum and maximum scaling limits
Cooldown period
Permissions
Service-linked role
Related resources

Prerequisites

Before you can use auto scaling, you must have already created an Amazon SageMaker model endpoint. You can have multiple model versions for the same endpoint. Each model is referred to as a production (model) variant. For more information about deploying a model endpoint, see Deploy the Model to SageMaker Hosting Services.

To activate auto scaling for a model, you can use the SageMaker console, the AWS Command Line Interface (AWS CLI), or an AWS SDK through the Application Auto Scaling API.

If this is your first time configuring scaling for a model, we recommend you Configure model auto scaling with the console.
When using the AWS CLI or the Application Auto Scaling API, the flow is to register the model as a scalable target, define the scaling policy, and then apply it. On the SageMaker console, under Inference in the navigation pane, choose Endpoints. Find your model's endpoint name and then choose it to find the variant name. You must specify both the endpoint name and the variant name to activate auto scaling for a model.

Scaling policy overview

To use auto scaling, you define a scaling policy that adds and removes the number of instances for your production variant in response to actual workloads.

To automatically scale as workload changes occur, you have two options: target tracking and step scaling policies.

We recommend using target tracking scaling policies. With target tracking, you choose an Amazon CloudWatch metric and target value. Auto scaling creates and manages the CloudWatch alarms for the scaling policy and calculates the scaling adjustment based on the metric and the target value. The policy adds and removes the number of instances as required to keep the metric at, or close to, the specified target value. For example, a scaling policy that uses the predefined InvocationsPerInstance metric with a target value of 70 can keep InvocationsPerInstance at, or close to 70. For more information, see Target tracking scaling policies in the Application Auto Scaling User Guide.

You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. Otherwise, using target tracking scaling is preferred as it will be fully automated. Note that step scaling can be managed only from the AWS CLI or the Application Auto Scaling API. For an overview of step scaling policies and how they work, see Step scaling policies in the Application Auto Scaling User Guide

To create a target tracking scaling policy, you specify the following:

Metric — The CloudWatch metric to track, such as average number of invocations per instance.
Target value — The target value for the metric, such as 70 invocations per instance per minute.

You can create target tracking scaling policies with either predefined metrics or custom metrics. A predefined metric is defined in an enumeration so that you can specify it by name in code or use it in the SageMaker console. Alternatively, you can use either the AWS CLI or the Application Auto Scaling API to apply a target tracking scaling policy based on a predefined or custom metric.

Note that scaling activities are performed with cooldown periods between them to prevent rapid fluctuations in capacity. You can optionally configure the cooldown periods for your scaling policy.

Scale based on a schedule

You can also create scheduled actions to perform scaling activities at specific times. You can create scheduled actions that scale one time only or that scale on a recurring schedule. After a scheduled action runs, your scaling policy can continue to make decisions about whether to scale dynamically as workload changes occur. Scheduled scaling can be managed only from the AWS CLI or the Application Auto Scaling API. For more information, see Scheduled scaling in the Application Auto Scaling User Guide.

Minimum and maximum scaling limits

When configuring auto scaling, you must specify your scaling limits before creating a scaling policy. You set limits separately for the minimum and maximum values.

The minimum value must be at least 1, and equal to or less than the value specified for the maximum value.

The maximum value must be equal to or greater than the value specified for the minimum value. SageMaker auto scaling does not enforce a limit for this value.

To determine the scaling limits that you need for typical traffic, test your auto scaling configuration with the expected rate of traffic to your model.

If a variant’s traffic becomes zero, SageMaker automatically scales in to the minimum number of instances specified. In this case, SageMaker emits metrics with a value of zero.

There are three options for specifying the minimum and maximum capacity:

Use the console to update the Minimum instance count and Maximum instance count settings.
Use the AWS CLI and include the --min-capacity and --max-capacity options when running the register-scalable-target command.
Call the RegisterScalableTarget API and specify the MinCapacity and MaxCapacity parameters.

Tip

You can manually scale out by increasing the minimum value, or manually scale in by decreasing the maximum value.

Cooldown period

A cooldown period is used to protect against over-scaling when your model is scaling in (reducing capacity) or scaling out (increasing capacity). It does this by slowing down subsequent scaling activities until the period expires. Specifically, it blocks the deletion of instances for scale-in requests, and limits the creation of instances for scale-out requests. For more information, see Define cooldown periods in the Application Auto Scaling User Guide.

You configure the cooldown period in your scaling policy.

If you don't specify a scale-in or a scale-out cooldown period, your scaling policy uses the default, which is 300 seconds for each.

If instances are being added or removed too quickly when you test your scaling configuration, consider increasing this value. You might see this behavior if the traffic to your model has a lot of spikes, or if you have multiple scaling policies defined for a variant.

If instances are not being added quickly enough to address increased traffic, consider decreasing this value.

Permissions

Auto scaling is made possible by a combination of the Amazon SageMaker, Amazon CloudWatch, and Application Auto Scaling APIs. For information about the minimum required permissions, see Application Auto Scaling identity-based policy examples in the Application Auto Scaling User Guide.

The SagemakerFullAccessPolicy IAM policy has all the IAM permissions required to perform auto scaling. For more information about SageMaker IAM permissions, see SageMaker Roles.

If you manage your own permission policy, you must include the following permissions:


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:DescribeEndpointConfig",
        "sagemaker:UpdateEndpointWeightsAndCapacities"
      ],
      "Resource": "*"
    },
    {    
        "Effect": "Allow",
        "Action": [
            "application-autoscaling:*"
        ],
        "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
      "Condition": {
        "StringLike": { "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"	}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DeleteAlarms"
      ],
      "Resource": "*"
    }
  ]
}

Service-linked role

Auto scaling uses the AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint service-linked role. This service-linked role grants Application Auto Scaling permission to describe the alarms for your policies, to monitor current capacity levels, and to scale the target resource. This role is created for you automatically. For automatic role creation to succeed, you must have permission for the iam:CreateServiceLinkedRole action. For more information, see Service-linked roles in the Application Auto Scaling User Guide.

For more information about configuring auto scaling, see the following resources:

application-autoscaling section of the AWS CLI Command Reference
Application Auto Scaling API Reference
Application Auto Scaling User Guide

Note

SageMaker recently introduced new inference capabilities built on real-time inference endpoints. You create a SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component, which is a SageMaker hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency and Reduce model deployment costs by 50% on average using the latest features of SageMaker on the AWS Blog.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Automatically scale models

Configure model auto scaling with the console