Determine the performance characteristics Calculate the target load

Load testing your auto scaling configuration

Perform load tests to choose a scaling configuration that works the way you want.

The following guidelines for load testing assume you are using a scaling policy that uses the predefined target metric SageMakerVariantInvocationsPerInstance.

Topics

Determine the performance characteristics
Calculate the target load

Determine the performance characteristics

Perform load testing to find the peak InvocationsPerInstance that your model's production variant can handle, and the latency of requests, as concurrency increases.

This value depends on the instance type chosen, payloads that clients of your model typically send, and the performance of any external dependencies your model has.

To find the peak requests-per-second (RPS) your model's production variant can handle and latency of requests

Set up an endpoint with your model using a single instance. For information about how to set up an endpoint, see Deploy the Model to SageMaker AI Hosting Services.
Use a load testing tool to generate an increasing number of parallel requests, and monitor the RPS and model latency in the out put of the load testing tool.

Note
You can also monitor requests-per-minute instead of RPS. In that case don't multiply by 60 in the equation to calculate SageMakerVariantInvocationsPerInstance shown below.

When the model latency increases or the proportion of successful transactions decreases, this is the peak RPS that your model can handle.

Calculate the target load

After you find the performance characteristics of the variant, you can determine the maximum RPS we should allow to be sent to an instance. The threshold used for scaling must be less than this maximum value. Use the following equation in combination with load testing to determine the correct value for the SageMakerVariantInvocationsPerInstance target metric in your scaling configuration.


SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60

Where MAX_RPS is the maximum RPS that you determined previously, and SAFETY_FACTOR is the safety factor that you chose to ensure that your clients don't exceed the maximum RPS. Multiply by 60 to convert from RPS to invocations-per-minute to match the per-minute CloudWatch metric that SageMaker AI uses to implement auto scaling (you don't need to do this if you measured requests-per-minute instead of requests-per-second).

Note

SageMaker AI recommends that you start testing with a SAFETY_FACTOR of 0.5. Test your scaling configuration to ensure it operates in the way you expect with your model for both increasing and decreasing customer traffic on your endpoint.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Scale an endpoint to zero instances

Use AWS CloudFormation to create a scaling policy