Optimize model inference for latency
Note
The Latency Optimized Inference feature is in preview release for Amazon Bedrock and is subject to change.
Latency-optimized inference for foundation models in Amazon Bedrock delivers faster
response times and improved responsiveness for AI applications. The optimized versions of Amazon Nova Pro,
Anthropic's Claude 3.5 Haiku
model
Accessing the latency optimization capability requires no additional setup or model fine-tuning, allowing for immediate enhancement of existing applications with faster response times. You can set the “Latency” parameter to “optimized” while calling the Amazon Bedrock runtime API. If you select "standard" as your invocation option, your requests will be served by standard inference. By default all requests are routed to through "standard".
“performanceConfig” : { “latency” : “standard | optimized” }
Once you reach the usage quota for latency optimization for a model, we will attempt to serve the request with Standard latency. In such cases, the request will be charged at Standard latency rates. The latency configuration for a served request is visible in API response and AWS CloudTrail logs. You can also view metrics for latency optimized requests in Amazon CloudWatch logs under "model-id+latency-optimized".
Latency optimized inference is available for Meta’s Llama 3.1 70B and 405B, as well as Anthropic’s Claude 3.5 Haiku in the US East (Ohio) and US West (Oregon) regions via cross-region inference.
Latency optimized inference is available for Amazon Nova Pro in the US East (N. Virginia), US East (Ohio), and US West (Oregon) regions via cross-region inference.
For more information about pricing, visit the pricing page
Note
Latency optimized inference for Llama 3.1 405B currently supports requests with total input and output token count up to 11K. For larger token count requests, we will fall back to the standard mode.
Provider | Model | Regions supporting inference profile |
---|---|---|
Amazon | Nova Pro |
us-east-1 us-east-2 |
Anthropic | Claude 3.5 Haiku |
us-east-2 us-west-2 |
Meta | Llama 3.1 405B Instruct |
us-east-2 |
Meta | Llama 3.1 70B Instruct |
us-east-2 us-west-2 |