Amazon SageMaker and Application Auto Scaling
You can scale SageMaker endpoint variants, provisioned concurrency for serverless endpoints, and inference components using target tracking scaling policies, step scaling policies, and scheduled scaling.
Use the following information to help you integrate SageMaker with Application Auto Scaling.
Service-linked role created for SageMaker
The following service-linked role is automatically created in your AWS account when registering SageMaker resources as scalable targets with Application Auto Scaling. This role allows Application Auto Scaling to perform supported operations within your account. For more information, see Service-linked roles for Application Auto Scaling.
-
AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint
Service principal used by the service-linked role
The service-linked role in the previous section can be assumed only by the service principal authorized by the trust relationships defined for the role. The service-linked role used by Application Auto Scaling grants access to the following service principal:
-
sagemaker.application-autoscaling.amazonaws.com
Registering SageMaker endpoint variants as scalable targets with Application Auto Scaling
Application Auto Scaling requires a scalable target before you can create scaling policies or scheduled actions for a SageMaker model (variant). A scalable target is a resource that Application Auto Scaling can scale out and scale in. Scalable targets are uniquely identified by the combination of resource ID, scalable dimension, and namespace.
If you configure auto scaling using the SageMaker console, then SageMaker automatically registers a scalable target for you.
If you want to configure auto scaling using the AWS CLI or one of the AWS SDKs, you can use the following options:
-
AWS CLI:
Call the register-scalable-target command for a product variant. The following example registers the desired instance count for a product variant called
my-variant
, running on themy-endpoint
endpoint, with a minimum capacity of one instance and a maximum capacity of eight instances.aws application-autoscaling register-scalable-target \ --service-namespace sagemaker \ --scalable-dimension sagemaker:variant:DesiredInstanceCount \ --resource-id endpoint/
my-endpoint
/variant/my-variant
\ --min-capacity1
\ --max-capacity8
If successful, this command returns the ARN of the scalable target.
{ "ScalableTargetARN": "arn:aws:application-autoscaling:
region
:account-id
:scalable-target/1234abcd56ab78cd901ef1234567890ab123" } -
AWS SDK:
Call the RegisterScalableTarget operation and provide
ResourceId
,ScalableDimension
,ServiceNamespace
,MinCapacity
, andMaxCapacity
as parameters.
Registering the provisioned concurrency of serverless endpoints as scalable targets with Application Auto Scaling
Application Auto Scaling also requires a scalable target before you can create scaling policies or scheduled actions for the provisioned concurrency of serverless endpoints.
If you configure auto scaling using the SageMaker console, then SageMaker automatically registers a scalable target for you.
Otherwise, use one of the following methods to register the scalable target:
-
AWS CLI:
Call the register-scalable-target command for a product variant. The following example registers the provisioned concurrency for a product variant called
my-variant
, running on themy-endpoint
endpoint, with a minimum capacity of one and a maximum capacity of ten.aws application-autoscaling register-scalable-target \ --service-namespace sagemaker \ --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \ --resource-id endpoint/
my-endpoint
/variant/my-variant
\ --min-capacity1
\ --max-capacity10
If successful, this command returns the ARN of the scalable target.
{ "ScalableTargetARN": "arn:aws:application-autoscaling:
region
:account-id
:scalable-target/1234abcd56ab78cd901ef1234567890ab123" } -
AWS SDK:
Call the RegisterScalableTarget operation and provide
ResourceId
,ScalableDimension
,ServiceNamespace
,MinCapacity
, andMaxCapacity
as parameters.
Registering inference components as scalable targets with Application Auto Scaling
Application Auto Scaling also requires a scalable target before you can create scaling policies or scheduled actions for inference components.
-
AWS CLI:
Call the register-scalable-target command for an inference component. The following example registers the desired copy count for an inference component called
my-inference-component
, with a minimum capacity of zero copies and a maximum capacity of three copies.aws application-autoscaling register-scalable-target \ --service-namespace sagemaker \ --scalable-dimension sagemaker:inference-component:DesiredCopyCount \ --resource-id inference-component/
my-inference-component
\ --min-capacity0
\ --max-capacity3
If successful, this command returns the ARN of the scalable target.
{ "ScalableTargetARN": "arn:aws:application-autoscaling:
region
:account-id
:scalable-target/1234abcd56ab78cd901ef1234567890ab123" } -
AWS SDK:
Call the RegisterScalableTarget operation and provide
ResourceId
,ScalableDimension
,ServiceNamespace
,MinCapacity
, andMaxCapacity
as parameters.
Related resources
If you are just getting started with Application Auto Scaling, you can find additional useful information about scaling your SageMaker resources in the Amazon SageMaker Developer Guide:
Note
In 2023, SageMaker introduced new inference capabilities built on real-time
inference endpoints. You create a SageMaker endpoint with an endpoint
configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component,
which is a SageMaker hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency