Monitor asynchronous endpoint - Amazon SageMaker

Monitor asynchronous endpoint

You can monitor SageMaker using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. With Amazon CloudWatch, you can access historical information and gain a better perspective on how your web application or service is performing. For more information about Amazon CloudWatch, see What is Amazon CloudWatch?

Monitoring with CloudWatch

The metrics below are an exhaustive list of metrics for asynchronous endpoints and are in the the AWS/SageMaker namespace. Any metric not listed below is not published if the endpoint is enabled for asynchronous inference. Such metrics include (but are not limited to):

  • OverheadLatency

  • Invocations

  • InvocationsPerInstance

Common Endpoint Metrics

These metrics are the same as the metrics published for real-time endpoints today. For more information about other metrics in Amazon CloudWatch, see Monitor SageMaker with Amazon CloudWatch.

Metric Name Description Unit/Stats

Invocation4XXErrors

The number of requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.

Units: None

Valid statistics: Average, Sum

Invocation5XXErrors

The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.

Units: None

Valid statistics: Average, Sum

ModelLatency

The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container.

Units: Microseconds

Valid statistics: Average, Sum, Min, Max, Sample Count

Asynchronous Inference Endpoint Metrics

These metrics are published for endpoints enabled for asynchronous inference. The following metrics are published with the EndpointName dimension:

Metric Name Description Unit/Stats

ApproximateBacklogSize

The number of items in the queue for an endpoint that are currently being processed or yet to be processed.

Units: Count

Valid statistics: Average, Max, Min

ApproximateBacklogSizePerInstance

Number of items in the queue divided by the number of instances behind an endpoint. This metric is primarily used for setting up application autoscaling for an async-enabled endpoint.

Units: Count

Valid statistics: Average, Max, Min

ApproximateAgeOfOldestRequest

Age of the oldest request in the queue.

Units: Seconds

Valid statistics: Average, Max, Min

HasBacklogWithoutCapacity

The value of this metric is 1 when there are requests in the queue but zero instances behind the endpoint. The value is 0 at all other times. You can use this metric for autoscaling your endpoint up from zero instances upon receiving a new request in the queue.

Units: Count

Valid statistics: Average

The following metrics are published with the EndpointName and VariantName dimensions:

Metric Name Description Unit/Stats

RequestDownloadFailures

When an inference failure occurs due to an issue downloading the request from Amazon S3.

Units: Count

Valid statistics: Sum

ResponseUploadFailures

When an inference failure occurs due to an issue uploading the response to Amazon S3.

Units: Count

Valid statistics: Sum

NotificationFailures

When an issue occurs publishing notifications.

Units: Count

Valid statistics: Sum

RequestDownloadLatency

Total time to download the request payload.

Units: Microseconds

Valid statistics: Average, Sum, Min, Max, Sample Count

ResponseUploadLatency

Total time to upload the response payload.

Units: Microseconds

Valid statistics: Average, Sum, Min, Max, Sample Count

ExpiredRequests

Number of requests in the queue that fail due to reaching their specified request TTL.

Units: Count

Valid statistics: Sum

InvocationFailures

If an invocation fails for any reason.

Units: Count

Valid statistics: Sum

InvocationsProcesssed

Number of async invocations processed by the endpoint.

Units: Count

Valid statistics: Sum

TimeInBacklog

Total time the request was queued before being processed. This does not include the actual processing time (i.e. downloading time, uploading time, model latency).

Units: Milliseconds

Valid statistics: Average, Sum, Min, Max, Sample Count

TotalProcessingTime

Time the inference request was recieved by SageMaker to the time the request finished processing. This includes time in backlog and time to upload and send response notifications, if any.

Units: Milliseconds

Valid statistics: Average, Sum, Min, Max, Sample Count

Amazon SageMaker Asynchronous Inference also includes host-level metrics. For information on host-level metrics, see SageMaker Jobs and Endpoint Metrics.

Logs

In addition to the Model container logs that are published to Amazon CloudWatch in your account, you also get a new platform log for tracing and debugging inference requests.

The new logs are published under the Endpoint Log Group:

/aws/sagemaker/Endpoints/[EndpointName]

The log stream name consists of:

[production-variant-name]/[instance-id]/data-log.

Log lines contain the request’s inference ID so that errors can be easily mapped to a particular request.