Key capabilities What's included Architecture and data flow Pricing

Amazon SageMaker AI detailed observability for inference endpoints

Detailed observability is a next-generation metrics experience for Amazon SageMaker AI real-time inference endpoints. Built on OpenTelemetry (OTel), it collects fine-grained operational metrics from GPU, node, and inference framework layers and publishes them to Amazon CloudWatch with rich labels including endpoint name, inference component name, instance ID, availability zone, and instance type.

Key capabilities

OpenTelemetry-native collection. Metrics are collected using an OTel Collector that scrapes Prometheus endpoints from DCGM (GPU metrics), node exporters (CPU, memory, disk), and inference framework containers (vLLM, SGLang).
Rich dimensional labels. Every metric is published with labels such as aws.sagemaker.endpoint.name, aws.sagemaker.inference_component.name, @resource.host.id, @resource.cloud.availability_zone, and @resource.host.type for precise filtering and aggregation.
Per-GPU attribution. GPU metrics (DCGM) include per-inference-component attribution, allowing you to identify which model is consuming GPU resources on multi-tenant instances.
Inference framework metrics. Native vLLM and SGLang metrics — including tokens per second, time to first token (TTFT), inter-token latency, KV cache utilization, queue depth, and batch size — are available without custom instrumentation.
PromQL query support. Query metrics using PromQL syntax in Amazon CloudWatch, CloudWatch Query Studio, or Amazon Managed Grafana.
Configurable scrape frequency. Control how often metrics are collected using MetricPublishFrequencyInSeconds (valid values: 10, 30, 60, 120, 180, 240, 300 seconds). Defaults to 60 seconds. Control plane metrics such as lifecycle, autoscaling, and ICE diagnostics are event-driven and not affected by this setting.

Note

Detailed observability publishes OpenTelemetry (OTel) metrics to Amazon CloudWatch via OTLP. These are not Prometheus metrics. The metrics are natively stored in Amazon CloudWatch as OTel metric data and are queryable using PromQL syntax. PromQL is supported as a query language only — no Prometheus server or Prometheus-compatible backend is involved.

What's included

Detailed observability metric categories
Category	Metrics	Scope	Frequency
Inference framework (vLLM/SGLang)	TTFT, ITL, KV cache, queue depth, batch size, TPS, concurrent requests	Per-IC for inference component endpoints, per-instance/per-endpoint for SME	Configurable
GPU health (DCGM)	GPU utilization, memory copy utilization, GPU temperature	Per-instance, per-GPU	Configurable
Node health	CPU, memory, disk, filesystem	Per-instance	Configurable
Inference component placement and high availability	IC copy count, copies per AZ, AZ skew, IC per instance, instances per AZ	Per-endpoint	Periodic
Lifecycle	Model download time, GPU load time, container start, cold start	Per-IC, per-endpoint	Event-driven
Autoscaling	Scaling events, E2E latency, rebalancing	Per-endpoint	Event-driven
ICE diagnostics	ICE count, failed type, failed AZ	Per-endpoint	Event-driven

For the complete list, see OpenTelemetry metrics reference.

Architecture and data flow

Architecture diagram showing the metric collection pipeline for detailed observability.

Each endpoint instance exposes metrics from multiple sources. The OTel Collector scrapes these sources, enriches the data with context labels, and exports it to your Amazon CloudWatch account.

Model container, DCGM exporter, and node exporter expose Prometheus-compatible metrics on the instance (inference framework metrics, GPU metrics, and CPU/memory/disk metrics respectively).
OTel Collector scrapes these endpoints and enriches each metric with labels such as endpoint name, inference component name, instance ID, and availability zone.
Enriched metrics are exported via OTLP to Amazon CloudWatch in your account.
Metrics are queryable via PromQL in Amazon CloudWatch at https://monitoring.region.amazonaws.com.

Pricing

Detailed observability metrics are included at no additional cost. For information about Amazon CloudWatch data ingestion costs related to OTel enrichment, see Amazon CloudWatch Pricing.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Enhanced metrics

Getting started