Asynchronous Inference - Amazon SageMaker

Asynchronous Inference

Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes up to 1GB, long processing times, and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

How It Works

Creating an asynchronous inference endpoint is similar to creating real-time inference endpoints. You can use your existing SageMaker models and only need to specify the AsyncInferenceConfig object while creating your endpoint configuration with the EndpointConfig field in the CreateEndpointConfig API. To invoke this endpoint, you need to place the request payload in Amazon S3 and provide a pointer to this payload as a part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon SNS. For more information about how to set up asynchronous notifications, see Check Prediction Results.

Note

The presence of an asynchronous inference configuration (AsyncInferenceConfig) object in the endpoint configuration implies that the endpoint can only receive asynchronous invocations.

How Do I Get Started?

If you are a first-time user of Amazon SageMaker Asynchronous Inference, we recommend that you do the following: