Service-linked role Service principal Registering SageMaker endpoint variants as scalable targets with Application Auto Scaling Registering the provisioned concurrency of serverless endpoints as scalable targets with Application Auto Scaling Registering inference components as scalable targets with Application Auto Scaling Related resources

Amazon SageMaker and Application Auto Scaling

You can scale SageMaker endpoint variants, provisioned concurrency for serverless endpoints, and inference components using target tracking scaling policies, step scaling policies, and scheduled scaling.

Use the following information to help you integrate SageMaker with Application Auto Scaling.

Service-linked role created for SageMaker

The following service-linked role is automatically created in your AWS account when registering SageMaker resources as scalable targets with Application Auto Scaling. This role allows Application Auto Scaling to perform supported operations within your account. For more information, see Service-linked roles for Application Auto Scaling.

AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint

Service principal used by the service-linked role

The service-linked role in the previous section can be assumed only by the service principal authorized by the trust relationships defined for the role. The service-linked role used by Application Auto Scaling grants access to the following service principal:

sagemaker.application-autoscaling.amazonaws.com

Registering SageMaker endpoint variants as scalable targets with Application Auto Scaling

Application Auto Scaling requires a scalable target before you can create scaling policies or scheduled actions for a SageMaker model (variant). A scalable target is a resource that Application Auto Scaling can scale out and scale in. Scalable targets are uniquely identified by the combination of resource ID, scalable dimension, and namespace.

If you configure auto scaling using the SageMaker console, then SageMaker automatically registers a scalable target for you.

If you want to configure auto scaling using the AWS CLI or one of the AWS SDKs, you can use the following options:

AWS CLI:

Call the register-scalable-target command for a product variant. The following example registers the desired instance count for a product variant called my-variant, running on the my-endpoint endpoint, with a minimum capacity of one instance and a maximum capacity of eight instances.


aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --min-capacity 1 \
  --max-capacity 8

If successful, this command returns the ARN of the scalable target.


{
    "ScalableTargetARN": "arn:aws:application-autoscaling:region:account-id:scalable-target/1234abcd56ab78cd901ef1234567890ab123"
}

AWS SDK:

Call the RegisterScalableTarget operation and provide ResourceId, ScalableDimension, ServiceNamespace, MinCapacity, and MaxCapacity as parameters.

Registering the provisioned concurrency of serverless endpoints as scalable targets with Application Auto Scaling

Application Auto Scaling also requires a scalable target before you can create scaling policies or scheduled actions for the provisioned concurrency of serverless endpoints.

If you configure auto scaling using the SageMaker console, then SageMaker automatically registers a scalable target for you.

Otherwise, use one of the following methods to register the scalable target:

AWS CLI:

Call the register-scalable-target command for a product variant. The following example registers the provisioned concurrency for a product variant called my-variant, running on the my-endpoint endpoint, with a minimum capacity of one and a maximum capacity of ten.


aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --min-capacity 1 \
  --max-capacity 10

If successful, this command returns the ARN of the scalable target.


{
    "ScalableTargetARN": "arn:aws:application-autoscaling:region:account-id:scalable-target/1234abcd56ab78cd901ef1234567890ab123"
}

AWS SDK:

Call the RegisterScalableTarget operation and provide ResourceId, ScalableDimension, ServiceNamespace, MinCapacity, and MaxCapacity as parameters.

Registering inference components as scalable targets with Application Auto Scaling

Application Auto Scaling also requires a scalable target before you can create scaling policies or scheduled actions for inference components.

AWS CLI:

Call the register-scalable-target command for an inference component. The following example registers the desired copy count for an inference component called my-inference-component, with a minimum capacity of zero copies and a maximum capacity of three copies.


aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
  --resource-id inference-component/my-inference-component \
  --min-capacity 0 \
  --max-capacity 3

If successful, this command returns the ARN of the scalable target.


{
    "ScalableTargetARN": "arn:aws:application-autoscaling:region:account-id:scalable-target/1234abcd56ab78cd901ef1234567890ab123"
}

AWS SDK:

Call the RegisterScalableTarget operation and provide ResourceId, ScalableDimension, ServiceNamespace, MinCapacity, and MaxCapacity as parameters.

If you are just getting started with Application Auto Scaling, you can find additional useful information about scaling your SageMaker resources in the Amazon SageMaker Developer Guide:

Note

In 2023, SageMaker introduced new inference capabilities built on real-time inference endpoints. You create a SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component, which is a SageMaker hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency and Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker on the AWS Blog.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Amazon Neptune

Spot Fleet (Amazon EC2)