Prepare the datasets - Amazon Bedrock

Prepare the datasets

Before you can begin a model customization job, you need to minimally prepare a training dataset. Whether a validation dataset is supported and the format of your training and validation dataset depend on the following factors.

  • The type of customization job (fine-tuning or Continued Pre-training).

  • The input and output modalities of the data.

To see dataset and file requirements for different models, see Model customization quotas.

Select the tab that is relevant to your use-case.

Fine-tuning: Text-to-text

To fine-tune a text-to-text model, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines. Each JSON line is a sample containing both a prompt and completion field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.

{"prompt": "<prompt1>", "completion": "<expected generated text>"} {"prompt": "<prompt2>", "completion": "<expected generated text>"} {"prompt": "<prompt3>", "completion": "<expected generated text>"}

The following is an example item for a question-answer task:

{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}
Fine-tuning: Text-to-image & Image-to-embeddings

To fine-tune a text-to-image or image-to-embedding model, prepare a training dataset by create a JSONL file with multiple JSON lines. Validation datasets are not supported. Each JSON line is a sample containing an image-ref, the Amazon S3 URI for an image, and a caption that could be a prompt for the image.

The images must be in JPEG or PNG format.

{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}

The following is an example item:

{"image-ref": "s3://my-bucket/my-pets/cat.png", "caption": "an orange cat with white spots"}

To allow Amazon Bedrock access to the image files, add an IAM policy similar to the one in Permissions to access training and validation files and to write output files in S3 to the Amazon Bedrock model customization service role that you set up or that was automatically set up for you in the console. The Amazon S3 paths you provide in the training dataset must be in folders that you specify in the policy.

Continued Pre-training: Text-to-text

To carry out Continued Pre-training on a text-to-text model, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines. Because Continued Pre-training involves unlabeled data, each JSON line is a sample containing only an input field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.

{"input": "<input text>"} {"input": "<input text>"} {"input": "<input text>"}

The following is an example item that could be in the training data.

{"input": "AWS stands for Amazon Web Services"}