Improve resilience with cross-region inference

When running model inference in on-demand mode, your requests might be restricted by service quotas or during peak usage times. Cross-region inference enables you to seamlessly manage unplanned traffic bursts by utilizing compute across different AWS Regions. With cross-region inference, you can distribute traffic across multiple AWS Regions, enabling higher throughput and enhanced resilience during periods of peak demands.

To use cross-region inference, you include an inference profile when running model inference by specifying the ID of the inference profile as the modelId when sending an InvokeModel, InvokeModelWithResponseStream, Converse, or ConverseStream request. An inference profile is an abstraction over an on-demand pool of resources from configured AWS Regions. An inference profile can route your inference request originating from your source region to another region configured in the pool. Use of cross-region inference increases throughput and improves resiliency by dynamically routing model invocation requests across the regions defined in inference profile. Routing factors in user traffic, demand and utilization of resources. The request is fulfilled in the region that it originated from.

Cross-region inference is currently available for the following features:

Model inference – You can use cross-region inference when running model invocation using the Playgrounds in the Amazon Bedrock console, or when using the InvokeModel, InvokeModelWithResponseStream, Converse, and ConverseStream operations. For more information, see Submit prompts and generate responses with model inference.
Knowledge base response generation – You can use cross-region inference when generating a response after querying a knowledge base or when parsing non-textual information in a data source. For more information, see Query a knowledge base and generate AI responses and Advanced parsing options.
Model evaluation – You can submit an inference profile as a model to evaluate when submitting a model evaluation job. For more information, see Choose the best performing model using Amazon Bedrock evaluations.
Prompt management – You can use cross-region inference when generating a response for a prompt you created in Prompt management. For more information, see Construct and store reusable prompts with Prompt management in Amazon Bedrock
Prompt flows – You can use cross-region inference when generating a response for a prompt you define inline in a prompt node in a prompt flow. For more information, see Build an end-to-end generative AI workflow with Amazon Bedrock Prompt flows.

You can also increase throughput for a model by purchasing Provisioned Throughput. Inference profiles currently don't support Provisioned Throughput.

Note the following information about cross-region inference:

There's no additional routing cost for using cross-region inference. The price is calculated based on the region from which you call an inference profile. For information about pricing, see Amazon Bedrock pricing.
When using cross-region inference, your throughput can reach up to double the allocated quotas in the region that the inference profile is in. The increase in throughput only applies to invocation performed via inference profiles, the regular quota still applies if you opt for in-region model invocation request. For example, if you invoke the US Anthropic Claude 3 Sonnet inference profile in US East (N. Virginia) (us-east-1), your throughput can reach up to 1,000 requests per minute and 2,000,000 tokens per minute. To see the default quotas for on-demand throughput, refer to the Runtime quotas section in Quotas for Amazon Bedrock or use the Service Quotas console.
Cross-region inference requests are kept within the regions that are part of the inference profile that was used. For example, a request made with an EU inference profile is kept within EU regions.

To learn more about cross-region inference, see Getting started with cross-region inference in Amazon Bedrock.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Increase throughput for resiliency and processing power

Supported Regions and models