Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

SageMaker training jobs pre-training tutorial (GPU)

Focus mode
SageMaker training jobs pre-training tutorial (GPU) - Amazon SageMaker AI

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with GPU instances.

  • Set up your environment

  • Launch a training job using SageMaker HyperPod recipes

Before you begin, make sure you have following prerequisites.

Prerequisites

Before you start setting up your environment, make sure you have:

  • Amazon FSx file system or an Amazon S3 bucket where you can load the data and output the training artifacts.

  • Requested a Service Quota for 1x ml.p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker AI. To request a service quota increase, do the following:

    1. On the AWS Service Quotas console, navigate to AWS services,

    2. Choose Amazon SageMaker AI.

    3. Choose one ml.p4d.24xlarge and one ml.p5.48xlarge instance.

  • Create an AWS Identity and Access Management(IAM) role with the following managed policies to give SageMaker AI permissions to run the examples.

    • AmazonSageMakerFullAccess

    • AmazonEC2FullAccess

  • Data in one of the following formats:

    • JSON

    • JSONGZ (Compressed JSON)

    • ARROW

  • (Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see User access tokens.

GPU SageMaker training jobs environment setup

Before you run a SageMaker training job, configure your AWS credentials and preferred region by running the aws configure command. As an alternative to the configure command, you can provide your credentials through environment variables such as AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN. For more information, see SageMaker AI Python SDK.

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see SageMaker JupyterLab.

  • (Optional) Set up the virtual environment and dependencies. If you are using a Jupyter notebook in Amazon SageMaker Studio, you can skip this step. Make sure you're using Python 3.9 or greater.

    # set up a virtual environment python3 -m venv ${PWD}/venv source venv/bin/activate # install dependencies after git clone. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt # Set the aws region. aws configure set <your_region>
  • Install SageMaker AI Python SDK

    pip3 install --upgrade sagemaker
  • Container: The GPU container is set automatically by the SageMaker AI Python SDK. You can also provide your own container.

    Note

    If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

    Append transformers==4.45.2 to requirements.txt in source_dir only when you're using the SageMaker AI Python SDK. For example, append it if you're using it in a notebook in SageMaker AI JupyterLab.

    If you are using HyperPod recipes to launch using cluster type sm_jobs, this will be done automatically.

Launch the training job using a Jupyter Notebook

You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the SageMaker AI Python SDK to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.

import os import sagemaker,boto3 from sagemaker.debugger import TensorBoardOutputConfig from sagemaker.pytorch import PyTorch sagemaker_session = sagemaker.Session() role = sagemaker.get_execution_role() bucket = sagemaker_session.default_bucket() output = os.path.join(f"s3://{bucket}", "output") output_path = "<s3-URI" overrides = { "run": { "results_dir": "/opt/ml/model", }, "exp_manager": { "exp_dir": "", "explicit_log_dir": "/opt/ml/output/tensorboard", "checkpoint_dir": "/opt/ml/checkpoints", }, "model": { "data": { "train_dir": "/opt/ml/input/data/train", "val_dir": "/opt/ml/input/data/val", }, }, } tensorboard_output_config = TensorBoardOutputConfig( s3_output_path=os.path.join(output, 'tensorboard'), container_local_output_path=overrides["exp_manager"]["explicit_log_dir"] ) estimator = PyTorch( output_path=output_path, base_job_name=f"llama-recipe", role=role, instance_type="ml.p5.48xlarge", training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain", recipe_overrides=recipe_overrides, sagemaker_session=sagemaker_session, tensorboard_output_config=tensorboard_output_config, ) estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the fit() method. Use the training_recipe parameter to specify the recipe you want to use for training.

Note

If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

Append transformers==4.45.2 to requirements.txt in source_dir only when you're using SageMaker AI Python SDK directly. For example, you must append the version to the text file when you're using a Jupyter notebook.

When you deploy the endpoint for a SageMaker training job, you must specify the image URI that you're using. If don't provide the image URI, the estimator uses the training image as the image for the deployment. The training images that SageMaker HyperPod provides don't contain the dependencies required for inference and deployment. The following is an example of how an inference image can be used for deployment:

from sagemaker import image_uris container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge') predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
Note

Running the preceding code on Sagemaker notebook instance might need more than the default 5GB of storage that SageMaker AI JupyterLab provides. If you run into space not available issues, create a new notebook instance where you use a different notebook instance and increase the storage of the notebook.

Launch the training job with the recipes launcher

Update the ./recipes_collection/cluster/sm_jobs.yaml file to look like the following:

sm_jobs_config: output_path: <s3_output_path> tensorboard_config: output_path: <s3_output_path> container_logs_path: /opt/ml/output/tensorboard # Path to logs on the container wait: True # Whether to wait for training job to finish inputs: # Inputs to call fit with. Set either s3 or file_system, not both. s3: # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation. train: <s3_train_data_path> val: null additional_estimator_kwargs: # All other additional args to pass to estimator. Must be int, float or string. max_run: 180000 enable_remote_debug: True recipe_overrides: exp_manager: explicit_log_dir: /opt/ml/output/tensorboard data: train_dir: /opt/ml/input/data/train model: model_config: /opt/ml/input/data/train/config.json compiler_cache_url: "<compiler_cache_url>"

Update ./recipes_collection/config.yaml to specify sm_jobs in the cluster and cluster_type.

defaults: - _self_ - cluster: sm_jobs # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.

Launch the job with the following command

python3 main.py --config-path recipes_collection --config-name config

For more information about configuring SageMaker training jobs, see Run a training job on SageMaker training jobs.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.