Prepare the datasets
Before you can begin a model customization job, you need to minimally prepare a training dataset. Whether a validation dataset is supported and the format of your training and validation dataset depend on the following factors.
To see dataset and file requirements for different models, see Model customization quotas.
Select the tab that is relevant to your use-case.
- Fine-tuning: Text-to-text
-
To fine-tune a text-to-text model, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines. Each JSON line is a sample containing both a prompt
and completion
field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.
{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}
The following is an example item for a question-answer task:
{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}
- Fine-tuning: Text-to-image & Image-to-embeddings
-
To fine-tune a text-to-image or image-to-embedding model, prepare a training dataset by create a JSONL file with multiple JSON lines. Validation datasets are not supported. Each JSON line is a sample containing an image-ref
, the Amazon S3 URI for an image, and a caption
that could be a prompt for the image.
The images must be in JPEG or PNG format.
{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}
The following is an example item:
{"image-ref": "s3://my-bucket/my-pets/cat.png", "caption": "an orange cat with white spots"}
To allow Amazon Bedrock access to the image files, add an IAM policy similar to the one in Permissions to access training and validation files and to write output files in S3 to the Amazon Bedrock model customization service role that you set up or that was automatically set up for you in the console. The Amazon S3 paths you provide in the training dataset must be in folders that you specify in the policy.
- Continued Pre-training: Text-to-text
-
To carry out Continued Pre-training on a text-to-text model, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines. Because Continued Pre-training involves unlabeled data, each JSON line is a sample containing only an input
field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.
{"input": "<input text>"}
{"input": "<input text>"}
{"input": "<input text>"}
The following is an example item that could be in the training data.
{"input": "AWS stands for Amazon Web Services"}