Fine-tuning Amazon Nova models using SageMaker training jobs
Fine-tuning Amazon Nova models with SageMaker training jobs offers a powerful way to customize foundation models for your specific use cases.
Topics
Prerequisites
Before you start a training job, note the following.
-
Amazon S3 buckets to store your input data and output of training jobs. You can either use one bucket for both or separate buckets for each type of the data. Make sure your buckets are in the same AWS Region where you create all the other resources for training. For more information, see Creating a general purpose bucket.
-
An IAM role with permissions to run a training job. Make sure you attach an IAM policy with
AmazonSageMakerFullAccess
. For more information, see How to use SageMaker execution roles. -
Base Amazon Nova recipes, see Getting Amazon Nova recipes.
Data preparation
Preparing high-quality, properly formatted data is a critical first step in the fine-tuning process for large language models. Whether you're using supervised fine-tuning (SFT) or Direct Preference Optimization (DPO), with either full-rank or low-rank adaptation (LoRA) approaches, your data must adhere to specific format requirements to ensure successful model training. This section outlines the necessary data formats, validation methods, and best practices to help you prepare your datasets effectively for fine-tuning Amazon Nova models.
SFT
SFT data format requirements - For both full-rank SFT and LoRA SFT, data should follow the Converse format. For examples and constraints of this format, see Preparing data for fine-tuning Understanding models.
SFT data validation - To validate your dataset format before submission, we
recommend using the following validation script from the Amazon Bedrock samples repositoryjsonl
files adhere to the required format specifications
and identify any potential issues before submitting your fine-tuning
job.
DPO
DPO data format requirements - For both DPO in full-rank and DPO with LoRA, data should follow the Converse format. The dataset also needs to be in the similar format as SFT except the last turn needs to have preference pairs.
DPO dataset other constraints - Other constraints on datasets are the same for SFT. For more information, see Dataset constraints. A single JSONL file for training and a single JSONL file for validation is expected. Validation set is optional.
DPO dataset recommendations - A minimum of 1,000 preference pairs for effective training. High-quality preference data will result in more efficient results.
Sample DPO data format
// N-1 turns same as SFT format { "role": "assistant", "candidates": [ { "content": [ { "text": "..." } // content list can contain multiple 'text' objects ], "preferenceLabel": "preferred" }, { "content": [ { "text": "..." } // content list can contain multiple 'text' objects ], "preferenceLabel": "non-preferred" } ] }
Sample DPO data format (multi-turn)
{ "system": [ { "text": "..." } ], "messages":[ { "role": "user", "content": [ { "text": "..." } ] }, { "role": "assistant", "content": [ { "text": "..." } ] }, { "role": "user", "content": [ { "text": "..." } ] }, { "role": "assistant", "candidates": [ { "content": [ { "text": "..." } ], "preferenceLabel": "preferred" }, { "content": [ { "text": "..." } ], "preferenceLabel": "non-preferred" } ] } ], }
Sample DPO data format (with images)
{ "system": [ { "text": "..." } ], "messages":[ { "role": "user", "content": [ { "text": "..." }, { "text": "..." }, { "image": { "format": "jpeg", "source": { "s3Location": { "uri": "s3://your-bucket/your-path/your-image.jpg", "bucketOwner": "your-aws-account-id" } } } } // "content" can have multiple "text" and "image" objects. // max image count is 10 ] }, { "role": "assistant", "content": [ { "text": "..." } ] }, { "role": "user", "content": [ { "text": "..." }, { "text": "..." }, { "image": { "format": "jpeg", "source": { "s3Location": { "uri": "s3://your-bucket/your-path/your-image.jpg", "bucketOwner": "your-aws-account-id" } } } } // "content" can have multiple "text" and "image" objects. // max image count is 10 ] }, { "role": "assistant", "candidates": [ { "content": [ { "text": "..." } ], "preferenceLabel": "preferred" }, { "content": [ { "text": "..." } ], "preferenceLabel": "non-preferred" } ] } ], }
Training jobs default to a 1-day time limit, though the estimates in the tables below assume a 5-day training duration for illustration purposes. As a best practice, we recommend increasing your training time limit to 28 days maximum to accommodate longer training workloads. To request a limit increase, see Requesting a quota increase.
SFT dataset limits
Context length | Model | Method | Datasets | Description |
---|---|---|---|---|
32k context length jobs |
Amazon Nova Micro | Full rank and LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can only have up to 100k records. |
Amazon Nova Lite |
Full rank | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 100k records. |
|
Image and video |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 50k records. |
|||
LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 100k records. |
||
Image and video | If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 90k records. | |||
Amazon Nova Pro |
Full rank | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 40k records. |
|
Image and video |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 30k records. |
|||
LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 40k records. |
||
Image and video |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 35k records. |
|||
64k context length jobs |
Amazon Nova Micro | Full rank and LoRA | Text only |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can only have up to 50k records. |
Amazon Nova Lite |
Full rank | Text only |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 50k records. |
|
Image and video |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 30k records. |
|||
LoRA | - |
LoRA is not supported at 64k for Nova Lite. |
||
Amazon Nova Pro | Full rank and LoRA | Text only |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 17k records. |
|
Image and video |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 15k records. |
DPO dataset limits
Context length | Model | Method | Datasets | Description |
---|---|---|---|---|
16k context length jobs |
Amazon Nova Micro | Full rank | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can only have up to 120k records. |
LoRA | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can only have up to 125k records. |
||
Amazon Nova Lite |
Full rank | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can have up to 130k records. |
|
Image |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can complete 20k samples within 2 days |
|||
LoRA | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can have up to 140k records. |
||
Image |
if you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can complete 20k samples within 2 days. |
|||
Amazon Nova Pro |
Full rank | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can have up to 45k records. |
|
Image |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can complete 20k samples within 4 days |
|||
LoRA | Text only |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can have up to 55k records. |
||
Image |
If you use a dataset where all records have 16k context length, and for example, run for 5 epochs, you can complete 20k samples within 4 days |
|||
32k context length jobs |
Amazon Nova Micro | Full rank | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can only have up to 45k records. |
LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can only have up to 50k records. |
||
Amazon Nova Lite |
Full rank | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 55k records. |
|
Image |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 35k records. |
|||
LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 60k records. |
||
Image |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 35k records. |
|||
Amazon Nova Pro | Full rank | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 20k records. |
|
Image |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 16k records. |
|||
LoRA | Text only |
If you use a dataset where all records have 32k context length, and for example, run for 5 epochs, you can have up to 22k records. |
||
Image |
If you use a dataset where all records have 64k context length, and for example, run for 5 epochs, you can have up to 18k records. |
By reducing the number of epochs or the context length of your records, you could provide more records.
Full-rank SFT and LoRA PEFT configurations
This section covers guidance on recipe configurations for both full-rank supervised fine-tuning (SFT) and low-rank adaptation parameter-efficient fine-tuning (LoRA PEFT) approaches. These recipe files serve as the blueprint for your model customization jobs, allowing you to specify training parameters, hyperparameters, and other critical settings that determine how your model learns from your data. To adjust the hyperparameters, follow the guidelines in Selecting hyperparameters.
Fine-tuning configurations (Full-rank SFT and LoRA PEFT)
The only difference between full-rank SFT and LoRA PEFT in terms of recipe is the
LoRA PEFT configuration, which is set to 'null' for full rank and set to
appropriate values if using LoRA PEFT-based fine-tuning. Example recipes are
available in the SageMaker HyperPod recipes
About "run" configuration.
Key | Definition | Micro | Lite | Pro | |
---|---|---|---|---|---|
Run configuration | model_type |
Specifies the Nova model variant to use. Do not modify this field. |
"amazon.nova-micro-v1:0:128k" |
"amazon.nova-lite-v1:0:300k" |
"amazon.nova-pro-v1:0:300k" |
model_name_or_path |
The path to the base model. |
"nova-micro/prod" |
"nova-lite/prod" |
"nova-pro/prod" |
|
replicas |
The number of compute instances to use for distributed training. |
2, 4 or 8 |
4, 8 or 16 |
6, 12 or 24 |
About “training_config” configuration.
Root key | Child keys | Definition | Min | Max | |
---|---|---|---|---|---|
max_length |
The maximum sequence length in tokens. This determines the context window size for training. Tunable to nearest 1024 multiple, max value: 65536 (for Lite Lora 32768). |
1024 |
65536, except Lite LoRA which supports 32768. | ||
global_batch_size |
Total samples per step, allowed values are 16, 32, 64. Max value: 32 for Nova Pro, 64 for Nova Lite and Micro. |
16 | 32 for Nova Pro, 64 for Nova Lite and Micro. | ||
Trainer configuration | trainer | max_epochs |
The number of complete passes through your training dataset. For most customization tasks, 1-5 epochs are typically sufficient. Recommended to keep up to 5. |
1 | - |
Model configuration | model | hidden_dropout |
Probability of dropping hidden state outputs. Increase (0.0-0.2) to reduce overfitting on smaller datasets. The bounds are between 0 - 1. |
0 | 1 |
model | attention_dropout |
Probability of dropping attention weights. Can help with generalization. The bounds are between 0 - 1. |
0 | 1 | |
model | ffn_dropout |
Probability of dropping feed-forward network outputs. The bounds are between 0 - 1. |
0 | 1 | |
Optimizer configuration | model.optim | lr |
Learning rate, controls step size during optimization. The limits are between 0 and 1. Typically set between 1e-6 and 1e-4. for good performance. |
0 | 1 |
model.optim | name |
Optimizer algorithm. Currently, only
|
- | - | |
model.optim | adam_w_mode |
Enable AdamW mode (true/false). |
- | - | |
model.optim | eps |
Epsilon for numerical stability. |
|||
model.optim | weight_decay |
L2 regularization strength, must be between 0.0 and 1.0. |
0 | 1 | |
model.optim | betas |
Adam optimizer betas, must be between 0.0 and 1.0. |
0 | 1 | |
model.optim | sched_warmup_steps |
Number of steps to gradually increase learning rate. This improves training stability. Between 1 and 20. |
1 | 20 | |
model.optim | sched_constant_steps |
Steps at constant learning rate. |
1.00E-10 | 1.00E-06 | |
model.optim | sched.min_lr |
Minimum learning rate at the end of decay. The limits are between 0 and 1, but must be less than learning rate. |
0 | 1 | |
LoRA PEFT configuration |
model.peft | peft_scheme |
Use "lora" or "null". "lora" uses LoRA PEFT method for parameter-efficient fine-tuning. "null" kicks off a full rank fine tuning. |
- | - |
model.peft | lora_tuning.loraplus_lr_ratio |
LoRA+ learning rate scaling factor, must be between 0.0 and 100.0. |
0 | 100 | |
model.peft | lora_tuning.alpha |
Scaling factor for LoRA weights. Allowed values are 32, 64, 96, 128, 160 and 192. |
32 | 192 | |
model.peft | lora_tuning.adapter_dropout |
Regularization for LoRA parameters.Must be between 0.0 and 1.0. |
0 | 1 |
Fine-tuning specific configurations (DPO)
The only difference between Direct Preference Optimization (DPO) as compared to
LoRA PEFT and FullRank SFT is in terms of dpo_cfg configuration and allowed
values. Refer to the table below the example for allowed specifically for
DPO. Example recipes are available in the SageMaker HyperPod recipes
Root key | Child keys | Definition | Min | Max | |
---|---|---|---|---|---|
max_length | The maximum sequence length in tokens. This determines the context window size for training. Tunable to nearest 1024 multiple, max value: 32,768. |
1024 |
32768 |
||
global_batch_size |
Global batch size, allowed values are {16, 32, 64, 128, 256}. |
16 | 256 | ||
Trainer configuration | trainer | max_epochs |
The number of complete passes through your training dataset. For most customization tasks, 1-5 epochs are typically sufficient. Max epochs is 5. |
1 | 5 |
Model configuration | model | hidden_dropout |
Probability of dropping hidden state outputs. Increase (0.0-0.2) to reduce overfitting on smaller datasets. The bounds are between 0 - 1. |
0 | 1 |
model | attention_dropout |
Probability of dropping attention weights. Can help with generalization. The bounds are between 0 - 1. |
0 | 1 | |
model | ffn_dropout |
Probability of dropping feed-forward network outputs. The bounds are between 0 - 1. |
0 | 1 | |
Optimizer configuration | model.optim | lr |
Learning rate, controls step size during optimization. The limits are between 0 and 1. Typically set between 1e-6 and 1e-4. for good performance. |
0 | 1 |
model.optim | name |
Optimizer algorithm. Currently, only
|
- | - | |
model.optim | adam_w_mode |
Enable AdamW mode (true/false). |
- | - | |
model.optim | eps |
Epsilon for numerical stability. |
1.00E-10 | 1.00E-06 | |
model.optim | weight_decay |
L2 regularization strength, must be between 0.0 and 1.0. |
0 | 1 | |
model.optim | betas |
Adam optimizer betas, must be between 0.0 and 1.0. |
0 | 1 | |
model.optim | sched_warmup_steps |
Number of steps to gradually increase learning rate. This improves training stability. Between 1 and 20. |
1 | 20 | |
model.optim | sched_constant_steps |
Steps at constant learning rate. |
|||
model.optim | sched.min_lr |
Minimum learning rate at the end of decay. The limits are between 0 and 1, but must be less than learning rate. |
0 | 1 | |
LoRA PEFT configuration |
model.peft | peft_scheme |
Use "lora" or "null". "lora" uses LoRA PEFT method for parameter-efficient fine-tuning. "null" kicks off a full rank fine tuning. |
- | - |
model.peft | lora_tuning.loraplus_lr_ratio |
LoRA+ learning rate scaling factor, must be between 0.0 and 100.0. |
0 | 100 | |
model.peft | lora_tuning.alpha |
Scaling factor for LoRA weights. Allowed values are 32, 64, 96, 128, 160 and 192. |
32 | 192 | |
model.peft | lora_tuning.adapter_dropout |
Regularization for LoRA parameters. Must be between 0.0 and 1.0. |
0 | 1 | |
DPO configuration | model-dpo_cfg | beta |
Strength of preference enforcement. |
0.001 | 0.1 |
Running customized Nova model on SageMaker training jobs
This section demonstrates how to run a customized Nova model on SageMaker training jobs through a Jupyter notebook environment. You'll find a complete example that walks through the process of configuring and launching a training job, along with reference tables for selecting the appropriate container image URIs and instance configurations. This approach gives you programmatic control over your fine-tuning workflows while leveraging SageMaker's managed infrastructure for model customization. For more information, see Use a SageMaker AI estimator to run a training job.
Reference tables
Before running the sample notebook, refer to the following tables for selecting the appropriate container image URIs and instance configurations.
Selecting image URI
Recipe | Image URI |
---|---|
SFT image URI |
708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-SFT-latest |
DPO image URI |
708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest |
Selecting instance type and count
Model | Job type | Instance type | Recommended instance count | Allowed instance count |
---|---|---|---|---|
Amazon Nova Micro | Fine-tuning (SFT, LoRA) |
g5.12xlarge |
1 | 1 |
Amazon Nova Micro | Fine-tuning (SFT, DPO) |
p5.48xlarge |
2 | 2,4,8 |
Amazon Nova Lite | Fine-tuning (SFT, DPO) |
p5.48xlarge |
4 | 4,8,16 |
Amazon Nova Pro | Fine-tuning (SFT, DPO) |
p5.48xlarge |
6 | 6,12,24 |
Sample notebook
The following sample notebook demonstrates how to run a training job. For additional getting started notebooks on how to customize Nova models using SageMaker AI training jobs, see Use a SageMaker AI estimator to run a training job.
# 1. Install dependencies !pip install sagemaker # 2. Import dependencies and initialize sagemaker session import sagemaker,boto3 sm = boto3.client('sagemaker', region_name='us-east-1') sagemaker_session = sagemaker.session.Session(boto_session=boto3.session.Session(), sagemaker_client=sm) # 3. Configure your job # Define the core configuration for launching a SageMaker Training Job. This includes input/output S3 URIs, container image, hardware setup, and other runtime parameters. Update the placeholders below before submitting the job. job_name = "<Your Job Name>" input_s3_uri = "<S3 path to input data>" validation_s3_uri = "<S3 path to validation data>" # optional, leave blank if no validation data output_s3_uri = "<S3 path to output location>" image_uri = "<Image URI from documentation>" # you can choose the image for SFT/DPO instance_type = "ml.p5.48xlarge" # do not change instance_count = <Integer number of hosts> # change hosts as needed. Refer to documentation for allowed values based on model type. role_arn = "<IAM Role you want to use to run the job>" recipe_path = "<Local path to the recipe file>" # 4. Launch SageMaker Training Job # This block sets up and runs the SageMaker training job using the PyTorch estimator. It configures the training image, hardware, input channels, and TensorBoard integration. Validation data is included if provided. from sagemaker.debugger import TensorBoardOutputConfig from sagemaker.pytorch import PyTorch from sagemaker.inputs import TrainingInput tensorboard_output_config = TensorBoardOutputConfig( s3_output_path=output_s3_uri, ) estimator = PyTorch( output_path=output_s3_uri, base_job_name=job_name, role=role_arn, instance_count=instance_count, instance_type=instance_type, training_recipe=recipe_path, sagemaker_session=sagemaker_session, image_uri=image_uri, tensorboard_output_config=tensorboard_output_config, # Add the setting for using TensorBoard. disable_profiler=True, debugger_hook_config=False ) trainingInput = TrainingInput( s3_data=input_s3_uri, distribution='FullyReplicated', s3_data_type='Converse' ) if (validation_s3_uri): validationInput = TrainingInput( s3_data=validation_s3_uri, distribution='FullyReplicated', s3_data_type='Converse' ) estimator.fit(inputs={"train": trainingInput, "validation": validationInput}) # inputs must be called "train" and "validation", do not change else: estimator.fit(inputs={"train": trainingInput}) # 5. Monitor your training job # To monitor your training job, you can either: # * Go to your AWS console -> Amazon Sagemaker AI -> Training -> Training Jobs # * run the following command # sm.describe_training_job(TrainingJobName="<complete training job name>")
Hyperparameter optimization guidance
Fine-tuning your Nova LLM model effectively requires careful selection of hyperparameters. While this section explains the basic recipe structure and components, optimizing hyperparameters for your specific use case often requires additional guidance. For comprehensive recommendations on hyperparameter selection, best practices, and optimization strategies, see Selecting hyperparameters. This resource provides detailed guidance on selecting appropriate learning rates, batch sizes, training epochs, and other critical parameters based on your dataset characteristics and training objectives. We recommend consulting this guide when fine-tuning your recipe configuration to achieve optimal model performance.
For details about minimum, maximum, and default values for epochs, learning rate, and learning warmup steps, see Hyperparameters for Understanding models.
Common recipe modifications
Here are some common recipe adjustments based on specific use cases:
-
For smaller datasets (< 1,000 examples)
training_config: max_epochs: 2 # More passes through a smaller dataset model: hidden_dropout: 0.1 # Increase regularization weight_decay: 0.01 # Increase regularization
-
For efficiency with limited compute
peft: peft_scheme: "lora" lora_tuning:
-
For complex instruction tuning
optim: lr: 5e-6 # Lower learning rate for more stable learning sched: warmup_steps: 100 # Longer warmup for stability