開始使用 SageMaker 模型平行程式庫 v2

在此頁面上，您將學習如何使用 SageMaker 模型平行程度程式庫 v2 API，並開始在訓練平台或叢集中執行 PyTorch 完全分片資料平行 (FSDP) SageMaker 訓練工作。 SageMaker HyperPod

使用 SMP v2 執行 PyTorch 訓練工作有多種案例。

若要進行 SageMaker 訓練，請使用 PyTorch v2.0.1 及更新版本的其中一個預先建置的 SageMaker 架構容器，這些容器已與 SMP v2 一起預先封裝。
使用 SMP v2 二進位檔案來設定 Conda 環境，以便在叢集上執行分散式訓練工作負載。 SageMaker HyperPod
擴充 PyTorch v2.0.1 及更新版本的預先建置 SageMaker 架構容器，以針對您的使用案例安裝任何其他功能需求。若要瞭解如何擴充預先建置的容器，請參閱延伸預先建置的容器。。
您也可以攜帶自己的 Docker 容器，並使用訓練工具組手動設定所有 SageMakerSageMaker 訓練環境，並安裝 SMP v2 二進位檔案。由於相依性的複雜性，這是最不推薦的選項。若要瞭解如何執行自己的 Docker 容器，請參閱調整您自己的訓練容器。

本入門指南涵蓋前兩種情況。

第 1 步：調整您的 PyTorch FSDP 培訓腳本

若要啟動並設定 SMP v2 程式庫，請先從指令碼頂端匯入和新增torch.sagemaker.init()模組。此模組採用您將準備SMP v2 核心功能組態參數的 SMP 組態字典。步驟 2：啟動訓練工作此外，若要使用 SMP v2 提供的各種核心功能，您可能需要進行更多變更以適應訓練指令碼。有關調整訓練指令碼以使用 SMP v2 核心功能的更詳細說明，請參閱。 SageMaker 模型平行程式庫 v2 的核心功能

步驟 2：啟動訓練工作

了解如何設定 SMP 發佈選項，以啟動具有 SMP 核心功能的 PyTorch FSDP 訓練工作。

SageMaker Training

當您在 SageMaker Python SDK 中設置PyTorch 框架估算器類的訓練作業啟動器對象時，SMP v2 核心功能組態參數通過distribution參數配置如下。

注意

SMP V2 的distribution設定已整合在 SageMaker Python SDK 中，從 v2.200 開始。請確定您使用 SageMaker Python SDK v2.200 或更新版本。

注意

在 SMP v2 中，您應該smdistributed使用 SageMaker PyTorch估算器torch_distributed的distribution引數進行配置。使用torch_distributed， SageMaker 運行torchrun，這是PyTorch 分佈式的默認多節點作業啟動器。


from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    framework_version=2.2.0,
    py_version="310"
    # image_uri="<smp-docker-image-uri>" # For using prior versions, specify the SMP image URI directly.
    entry_point="your-training-script.py", # Pass the training script you adapted with SMP from Step 1.
    ... # Configure other required and optional parameters
    distribution={
        "torch_distributed": { "enabled": True },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "hybrid_shard_degree": Integer,
                    "sm_activation_offloading": Boolean,
                    "activation_loading_horizon": Integer,
                    "fsdp_cache_flush_warnings": Boolean,
                    "allow_empty_shards": Boolean,
                    "tensor_parallel_degree": Integer,
                    "expert_parallel_degree": Integer,
                    "random_seed": Integer
                }
            }
        }
    }
)

重要

若要使用其中一個舊版 PyTorch 或 SMP 而非最新版本，您需要直接使用image_uri引數而非和配對來指定 SMP Docker 映像。framework_version py_version下面是一個例子


estimator = PyTorch(
    ...,
    image_uri="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121"
)

若要尋找 SMP 泊塢視窗影像 URI，請參閱。支援的架構

SageMaker HyperPod

開始之前，請確定是否符合下列先決條件。

將 Amazon FSx 共用目錄掛載到您的 HyperPod 叢集 (/fsx)。
安裝在 FSx 共用目錄中的康達。若要瞭解如何安裝 Conda，請參閱 Con da 使用者指南中的在 Linux 上安裝的指示。
cuda11.8或cuda12.1安裝在 HyperPod 叢集的頭部和運算節點上。

如果全部符合先決條件，請繼續執行下列有關在 HyperPod 叢集上使用 SMP v2 啟動工作負載的指示。

準備包含的字典的smp_config.json檔案SMP v2 核心功能組態參數。請務必將此 JSON 檔案上傳至您儲存訓練指令碼的位置，或是您在步驟 1 中指定至torch.sagemaker.init()模組的路徑。如果您已經將設定字典傳遞至步驟 1 中訓練指令碼中的torch.sagemaker.init()模組，您可以略過此步驟。
```
// smp_config.json
{
    "hybrid_shard_degree": Integer,
    "sm_activation_offloading": Boolean,
    "activation_loading_horizon": Integer,
    "fsdp_cache_flush_warnings": Boolean,
    "allow_empty_shards": Boolean,
    "tensor_parallel_degree": Integer,
    "expert_parallel_degree": Integer,
    "random_seed": Integer
}
```
將smp_config.json檔案上傳至檔案系統中的目錄。目錄路徑必須與您在步驟 1 中指定的路徑相符。如果您已將設定字典傳遞至訓練指令碼中的torch.sagemaker.init()模組，則可以略過此步驟。
在叢集的運算節點上，使用下列指令啟動終端機工作階段。
```
sudo su -l ubuntu
```

在運算節點上建立 Conda 環境。下列程式碼是建立 Conda 環境並安裝 SMP、SMDDP、CUDA 及其他相依性的範例指令碼。


# Run on compute nodes
SMP_CUDA_VER=<11.8 or 12.1>

source /fsx/<path_to_miniconda>/miniconda3/bin/activate

export ENV_PATH=/fsx/<path to miniconda>/miniconda3/envs/<ENV_NAME>
conda create -p ${ENV_PATH} python=3.10

conda activate ${ENV_PATH}

# Verify aws-cli is installed: Expect something like "aws-cli/2.15.0*"
aws ‐‐version
# Install aws-cli if not already installed
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#cliv2-linux-install

# Install the SMP library
conda install pytorch="2.0.1=sm_py3.10_cuda${SMP_CUDA_VER}*" packaging ‐‐override-channels \
  -c https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-2.0.0-pt-2.0.1/2023-12-11/smp-v2/ \
  -c pytorch -c numba/label/dev \
  -c nvidia -c conda-forge

# Install dependencies of the script as below
python -m pip install packaging transformers==4.31.0 accelerate ninja tensorboard h5py datasets \
    && python -m pip install expecttest hypothesis \
    && python -m pip install "flash-attn>=2.0.4" ‐‐no-build-isolation

# Install the SMDDP wheel
SMDDP_WHL="smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl" \
  && wget -q https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/${SMDDP_WHL} \
  && pip install ‐‐force ${SMDDP_WHL} \
  && rm ${SMDDP_WHL}

# cuDNN installation for Transformer Engine installation for CUDA 11.8
# Please download from below link, you need to agree to terms 
# https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.5/local_installers/11.x/cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz

tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
    && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
    && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
    && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
    && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
    && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/

# Please download from below link, you need to agree to terms 
# https://developer.download.nvidia.com/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
# cuDNN installation for TransformerEngine installation for cuda12.1
tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
    && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
    && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
    && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
    && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
    && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
    
# TransformerEngine installation
export CUDA_HOME=/usr/local/cuda-$SMP_CUDA_VER
export CUDNN_PATH=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_LIBRARY=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_INCLUDE_DIR=/usr/local/cuda-$SMP_CUDA_VER/include
export PATH=/usr/local/cuda-$SMP_CUDA_VER/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-$SMP_CUDA_VER/lib

python -m pip install ‐‐no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v1.0

執行測試訓練工作。
1. 在共用檔案系統 (/fsx) 中，複製 Awsome 分散式訓練 GitHub 存放庫，然後移至3.test_cases/11.modelparallel資料夾。
```
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/11.modelparallel
```
2. 使用sbatch以下方式提交作業。
```
conda activate <ENV_PATH>
sbatch -N 16 conda_launch.sh
```
  如果工作提交成功，則此sbatch命令的輸出消息應類似於Submitted batch job ABCDEF。
3. 檢查目前目錄下的記錄檔logs/。
```
tail -f ./logs/fsdp_smp_ABCDEF.out
```

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

支援的架構與 AWS 區域

SMP V2 的核心功能