Meta Llama models - Amazon Bedrock

Meta Llama models

This section provides inference parameters and a code example for using the following models from Meta.

  • Llama 2

  • Llama 2 Chat

  • Llama 3 Instruct

You make inference requests to Meta Llama models with InvokeModel or InvokeModelWithResponseStream (streaming). You need the model ID for the model that you want to use. To get the model ID, see Amazon Bedrock model IDs.

Request and response

The request body is passed in the body field of a request to InvokeModel or InvokeModelWithResponseStream.

Request

Llama 2 Chat, Llama 2, and Llama 3 Instruct models have the following inference parameters.

{ "prompt": string, "temperature": float, "top_p": float, "max_gen_len": int }

The following are required parameters.

  • prompt – (Required) The prompt that you want to pass to the model. With Llama 2 Chat, format the conversation with the following template.

    <s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]

    The instructions between the <<SYS>> tokens provides a system prompt for the model. The following is an example prompt that includes a system prompt.

    <s>[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <</SYS>> There's a llama in my garden What should I do? [/INST]

    For more information, see the following.

The following are optional parameters.

  • temperature – Use a lower value to decrease randomness in the response.

    Default Minimum Maximum

    0.5

    0

    1

  • top_p – Use a lower value to ignore less probable options. Set to 0 or 1.0 to disable.

    Default Minimum Maximum

    0.9

    0

    1

  • max_gen_len – Specify the maximum number of tokens to use in the generated response. The model truncates the response once the generated text exceeds max_gen_len.

    Default Minimum Maximum

    512

    1

    2048

Response

Llama 2 Chat, Llama 2, and Llama 3 Instruct models return the following fields for a text completion inference call.

{ "generation": "\n\n<response>", "prompt_token_count": int, "generation_token_count": int, "stop_reason" : string }

More information about each field is provided below.

  • generation – The generated text.

  • prompt_token_count – The number of tokens in the prompt.

  • generation_token_count – The number of tokens in the generated text.

  • stop_reason – The reason why the response stopped generating text. Possible values are:

    • stop – The model has finished generating text for the input prompt.

    • length – The length of the tokens for the generated text exceeds the value of max_gen_len in the call to InvokeModel (InvokeModelWithResponseStream, if you are streaming output). The response is truncated to max_gen_len tokens. Consider increasing the value of max_gen_len and trying again.

Example code

This example shows how to call the Meta Llama 2 Chat 13B model.

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: Apache-2.0 """ Shows how to generate text with Meta Llama 2 Chat (on demand). """ import json import logging import boto3 from botocore.exceptions import ClientError logger = logging.getLogger(__name__) logging.basicConfig(level=logging.INFO) def generate_text(model_id, body): """ Generate an image using Meta Llama 2 Chat on demand. Args: model_id (str): The model ID to use. body (str) : The request body to use. Returns: response (JSON): The text that the model generated, token information, and the reason the model stopped generating text. """ logger.info("Generating image with Meta Llama 2 Chat model %s", model_id) bedrock = boto3.client(service_name='bedrock-runtime') response = bedrock.invoke_model( body=body, modelId=model_id) response_body = json.loads(response.get('body').read()) return response_body def main(): """ Entrypoint for Meta Llama 2 Chat example. """ logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s") model_id = "meta.llama2-13b-chat-v1" prompt = """<s>[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <</SYS>> There's a llama in my garden What should I do? [/INST]""" max_gen_len = 128 temperature = 0.1 top_p = 0.9 # Create request body. body = json.dumps({ "prompt": prompt, "max_gen_len": max_gen_len, "temperature": temperature, "top_p": top_p }) try: response = generate_text(model_id, body) print(f"Generated Text: {response['generation']}") print(f"Prompt Token count: {response['prompt_token_count']}") print(f"Generation Token count: {response['generation_token_count']}") print(f"Stop reason: {response['stop_reason']}") except ClientError as err: message = err.response["Error"]["Message"] logger.error("A client error occurred: %s", message) print("A client error occured: " + format(message)) else: print( f"Finished generating text with Meta Llama 2 Chat model {model_id}.") if __name__ == "__main__": main()