How inference works in Amazon Bedrock
When you submit an input to a model, the model predicts a probable sequence of tokens that follows, and returns that sequence as the output. Amazon Bedrock provides you the capability of running inference with the foundation model of your choice. When you run inference, you provide the following inputs:
-
Prompt – An input provided to the model in order for it to generate a response. For information about writing prompts, see Prompt engineering concepts. For information about protecting against prompt injection attacks, see Prompt injection security.
-
Model – A foundation model or inference profile to run inference with. The model or inference profile that you choose also specifies a level of throughput, which defines the number and rate of input and output tokens that you can process. For more information about the foundation models that are available in Amazon Bedrock, see Amazon Bedrock foundation model information. For more information about inference profiles, see Set up a model invocation resource using inference profiles. For more information about increasing throughput, see Increase throughput with cross-Region inference and Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.
-
Inference parameters – A set of values that can be adjusted to limit or influence the model response. For information about inference parameters, see Influence response generation with inference parameters and Inference request parameters and response fields for foundation models.
Invoking models in different AWS Regions
When you invoke a model, you choose the AWS Region in which to invoke it. The quotas for the frequency and size of the requests that you can make depend on the Region. You can find these quotas by searching for the following quotas at Amazon Bedrock service quotas:
-
On-demand model inference requests per minute for
${Model}
-
On-demand InvokeModel tokens per minute for
${Model}
You can also invoke an inference profile instead of the foundation model itself. An inference profile defines a model and one or more Regions to which the inference profile can route model invocation requests. By invoking an inference profile that includes multiple Regions, you can increase your throughput. For more information, see Increase throughput with cross-Region inference. To see the quotas for the frequency and size of the requests that you can make with an inference profile, search for the following quotas at Amazon Bedrock service quotas:
-
Cross-Region InvokeModel requests per minute for
${Model}
-
Cross-Region InvokeModel tokens per minute for
${Model}
-
Global Cross-Region InvokeModel requests per minute for
${Model}
-
Global Cross-Region InvokeModel tokens per minute for
${Model}
Requests made to a Region may be served out of local zones that share the same parent Region. For example, requests made to US East (N. Virginia) (us-east-1) may be served out of any local zone associated with it, such as Atlanta, US (us-east-1-atl-2a).
The same principle applies when using cross-Region inference. For example, requests made to the US Anthropic Claude 3 Haiku inference profile may be served out of any local zone whose parent Region is in US, such as Seattle, US (us-west-2-sea-1a). When new local zones are added to AWS, they will be also be added to the corresponding cross-Region inference endpoint.
To see a list of local endpoints and the parent Regions they're associated with, see AWS Local Zones Locations
When you invoke a cross-Region inference profile in Amazon Bedrock, your request originates from a source Region and is automatically routed to one of the destination Regions defined in that profile, optimizing for performance. The destination Regions for Global cross-Region inference profile includes all commercial Regions.
Global cross-Region inference profile for a specific model can change over time as AWS adds more commercial Regions where your requests can be processed. However, if an inference profile is tied to a geography (such as US, EU, or APAC), its destination Region list will never change. AWS might create new inference profiles that incorporate new Regions. You can update your systems to use these inference profiles by changing the IDs in your setup to the new ones.
Note
The destination Regions in a cross-Region inference profile can include opt-in Regions, which are Regions that you must explicitly enable at AWS account or Organization level. To learn more, see Enable or disable AWS Regions in your account. When using a cross-Region inference profile, your inference request can be routed to any of the destination Regions in the profile, even if you did not opt-in to such Regions in your account.
Service Control Policies (SCPs) and AWS Identity and Access Management (IAM) policies work together to control where cross-Region inference is allowed. Using SCPs, you can control which
Regions Amazon Bedrock can use for inference, and using IAM policies, you can define which users or roles have permission to run inference. If any destination Region in a
cross-Region inference profile is blocked in your SCPs, the request will fail even if other Regions remain allowed. To ensure efficient operation with cross-region inference,
you can update your SCPs and IAM policies to allow all required Amazon Bedrock inference actions (for example, bedrock:InvokeModel*
or bedrock:CreateModelInvocationJob
)
in all destination Regions included in your chosen inference profile. To learn more, see https://aws.amazon.com/blogs/machine-learning/enable-amazon-bedrock-cross-region-inference-in-multi-account-environments/