Inference parameters for foundation models
Run inference using any of the foundation models in Amazon Bedrock. Optionally, set inference
parameters to influence the response generated by the model. You set inference parameters in a playground in the console, or in the body field of the InvokeModel or InvokeModelWithResponseStream API.
When you call a model, you also include a prompt for the model. For information about writing prompts, see Prompt engineering guidelines.
The following sections define the inference parameters available for each base model. For a custom model, use the same inference parameters as the base model from which it was customized.
Topics
Inference parameter definitions
Typically, foundation models support the following types of inference parameters.
Randomness and diversity
Foundation models typically support the following parameters to control randomness and diversity in the response.
-
Temperature– Large language models use probability to construct the words in a sequence. For any given sequence, there is a probability distribution of options for the next word in the sequence. When you set the temperature closer to zero, the model tends to select the higher-probability words. When you set the temperature further away from zero, the model may select a lower-probability word.
In technical terms, the temperature modulates the probability density function for the next tokens, implementing the temperature sampling technique. This parameter can deepen or flatten the density function curve. A lower value results in a steeper curve with more deterministic responses, and a higher value results in a flatter curve with more random responses.
-
Top K – Temperature defines the probability distribution of potential words, and Top K defines the cutoff where the model no longer selects the words. For example, if K=50, the model selects from 50 of the most probable words that could be next in a given sequence. When you lower the Top K value, it reduces the probability that an unusual word gets selected next in a sequence.
In technical terms, Top K is the number of the highest-probability vocabulary tokens to keep for Top-K-filtering.
-
Top P – Top P defines a cut off based on the sum of probabilities of the potential choices. If you set Top P below 1.0, the model considers the most probable options and ignores less probable ones. Top P is similar to Top K, but instead of capping the number of choices, it caps choices based on the sum of their probabilities.
For the example prompt "I hear the hoof beats of ," you might want the model to provide "horses," "zebras," or "unicorns" as the next word. If you set the temperature to its maximum, without capping Top K or Top P, you increase the probability of getting unusual results such as "unicorns." If you set the temperature to 0, you increase the probability of "horses." If you set a high temperature and reduce the value of Top K or Top P, you increase the probability of "horses" or "zebras," and decrease the probability of "unicorns."
Length
Foundation models typically support the following parameters control the length of the generated response.
-
Response length – Configures the minimum and maximum number of tokens to use in the generated response.
-
Length penalty – Length penalty optimizes the model to be more concise in its output by penalizing longer responses.
In technical terms, the length penalty penalizes the model exponentially for lengthy responses. 0.0 means no penalty. To generate longer sequences, set a value less than 0.0 for the model. To generate shorter sequences, set a value greater than 0.0.
-
Stop sequences – A stop sequence is a sequence of characters. If the model encounters a stop sequence, it stops generating further tokens. Different models support different types of characters in a stop sequence, different maximum sequence lengths, and may support the definition of multiple stop sequences.
Repetitions
Foundation models typically support the following parameters help control repetition in the generated response.
-
Repetition penalty (presence penalty) – Prevents repetitions of the same words (tokens) in responses. 1.0 means no penalty. Greater than 1.0 decreases repetition.