Le traduzioni sono generate tramite traduzione automatica. In caso di conflitto tra il contenuto di una traduzione e la versione originale in Inglese, quest'ultima prevarrà.
Per creare il tuo contenitore Docker per l'addestramento e utilizzare la libreria parallela di dati SageMaker AI, devi includere le dipendenze corrette e i file binari delle librerie parallele distribuite SageMaker AI nel tuo Dockerfile. Questa sezione fornisce istruzioni su come creare un Dockerfile completo con il set minimo di dipendenze per l'addestramento distribuito nell' SageMaker intelligenza artificiale utilizzando la libreria parallela di dati.
Nota
Questa opzione Docker personalizzata con la libreria parallela di dati SageMaker AI come binario è disponibile solo per PyTorch.
Per creare un Dockerfile con il toolkit di SageMaker formazione e la libreria data parallel
-
Inizia con un'immagine Docker di NVIDIA CUDA
. Usa le versioni per sviluppatori cuDNN che contengono runtime CUDA e strumenti di sviluppo (intestazioni e librerie) per creare dal codice sorgente. PyTorch FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Suggerimento
Le immagini ufficiali del AWS Deep Learning Container (DLC) sono create a partire dalle immagini di base di NVIDIA CUDA.
Se desideri utilizzare le immagini DLC predefinite come riferimenti mentre segui il resto delle istruzioni, consulta AWS Deep Learning Containers for PyTorch Dockerfiles. -
Aggiungi i seguenti argomenti per specificare le versioni e gli altri pacchetti. PyTorch Inoltre, indica i percorsi dei bucket Amazon S3 verso la libreria parallela di SageMaker dati AI e altri software per utilizzare AWS le risorse, come il plug-in Amazon S3.
Per utilizzare versioni delle librerie di terze parti diverse da quelle fornite nel seguente esempio di codice, ti consigliamo di consultare i Dockerfile ufficiali di AWS Deep Learning Container PyTorch per
trovare versioni testate, compatibili e adatte alla tua applicazione. URLs Per trovare l'
SMDATAPARALLEL_BINARY
argomento, consulta le tabelle di ricerca all'indirizzo. Framework supportatiARG PYTORCH_VERSION=
1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws
-
Imposta le seguenti variabili di ambiente per creare correttamente i componenti di SageMaker addestramento ed eseguire la libreria parallela di dati. Queste variabili vengono utilizzate per i componenti nelle fasi successive.
# Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
-
Installa o aggiorna
curl
,wget
egit
per scaricare e creare pacchetti nelle fasi successive.RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/*
-
Installa il software Elastic Fabric Adapter (EFA) per la comunicazione di rete Amazon. EC2
RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa
-
Installa Conda
per eseguire la gestione dei pacchetti. RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya
-
Scarica, crea e installa PyTorch e le relative dipendenze. Creiamo PyTorch a partire dal codice sorgente
perché dobbiamo avere il controllo della versione NCCL per garantire la compatibilità con il plug-in AWS OFI NCCL. -
Seguendo i passaggi del dockerfile PyTorch ufficiale
, installa build dependencies e configura ccache per velocizzare la ricompilazione. RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
-
Le dipendenze comuni di Install PyTorch e Linux.
# Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113
-
Clona il PyTorch GitHub repository.
RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
-
Installa e crea una versione NCCL
specifica. Per fare ciò, sostituisci il contenuto nella cartella NCCL predefinita ( /pytorch/third_party/nccl
) con la PyTorch versione NCCL specifica del repository NVIDIA. La versione NCCL è stata impostata nella fase 3 di questa guida.RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
-
PyTorchCompila e installa. Il completamento di questo processo richiede in genere poco più di 1 ora. È costruito utilizzando la versione NCCL scaricata nella fase precedente.
RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch
-
-
Crea e installa il plugin AWS OFI NCCL
. Ciò consente il supporto libfabric per la libreria parallela di dati SageMaker AI. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl
-
Compila e installa TorchVision
. RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision
-
Installa e configura OpenSSH. OpenSSH è necessario per consentire a MPI di comunicare tra container. Consenti a OpenSSH di comunicare con i container senza chiedere conferma.
RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
-
Installa il plug-in PT S3 per accedere in modo efficiente ai set di dati in Amazon S3.
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
-
Installa la libreria libboost
. Questo pacchetto è necessario per collegare in rete la funzionalità IO asincrona della libreria parallela di SageMaker dati AI. WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost
-
Installa i seguenti strumenti di SageMaker intelligenza artificiale per la formazione. PyTorch
WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training
-
Infine, installa il binario parallelo dei dati SageMaker AI e le dipendenze rimanenti.
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
-
Dopo aver finito di creare il Dockerfile, consulta Adapting Your Own Training Container per scoprire come creare il contenitore Docker, ospitarlo in Amazon ECR ed eseguire un processo di formazione utilizzando l'SDK Python. SageMaker
Il codice di esempio seguente mostra un Dockerfile completo dopo aver combinato tutti i blocchi di codice precedenti.
# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
# Set appropiate versions and location for components
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws
# Set ENV variables required to build PyTorch
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3
# Add OpenMPI to the path.
ENV PATH /opt/amazon/openmpi/bin:$PATH
# Add Conda to path
ENV PATH $CONDA_PREFIX/bin:$PATH
# Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
# Add enviroment variable for processes to be able to call fork()
ENV RDMAV_FORK_SAFE=1
# Indicate the container type
ENV DLC_CONTAINER_TYPE=training
# Add EFA and SMDDP to LD library path
ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
# Install basic dependencies to download and build other dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
git \
&& rm -rf /var/lib/apt/lists/*
# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/efa
# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p $CONDA_PREFIX && \
rm ~/miniconda.sh && \
$CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
$CONDA_PREFIX/bin/conda clean -ya
# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
git \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*
# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
# Common dependencies for PyTorch
RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# Linux specific dependency for PyTorch
RUN conda install -c pytorch magma-cuda113
# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.
# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
# Build and install PyTorch.
RUN cd /pytorch \
&& CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install \
&& rm -rf /pytorch
RUN ccache -C
# Build and install OFI plugin. \
# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
&& apt-get install -y --no-install-recommends \
autoconf \
automake \
libtool
RUN mkdir /tmp/efa-ofi-nccl \
&& cd /tmp/efa-ofi-nccl \
&& git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
&& cd aws-ofi-nccl \
&& ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa \
--with-mpi=/opt/amazon/openmpi \
--with-cuda=/usr/local/cuda \
--with-nccl=$CONDA_PREFIX \
&& make \
&& make install \
&& rm -rf /tmp/efa-ofi-nccl
# Build and install Torchvision
RUN pip install --no-cache-dir -U \
packaging \
mpi4py==3.0.3
RUN cd /tmp \
&& git clone https://github.com/pytorch/vision.git -b v0.9.1 \
&& cd vision \
&& BUILD_VERSION="0.9.1+cu111" python setup.py install \
&& cd /tmp \
&& rm -rf vision
# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation
RUN apt-get update \
&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
&& apt-get install -y --no-install-recommends openssh-client openssh-server \
&& mkdir -p /var/run/sshd \
&& cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
&& rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
RUN mkdir -p /var/run/sshd && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
mkdir -p /root/.ssh/ && \
ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
&& printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
# Install libboost from source.
# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
&& ./bootstrap.sh \
&& ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
&& cd .. \
&& rm -rf boost_1_73_0.tar.gz \
&& rm -rf boost_1_73_0 \
&& cd ${CONDA_PREFIX}/include/boost
# Install SageMaker AI PyTorch training.
WORKDIR /root
RUN pip install --no-cache-dir -U \
smclarify \
"sagemaker>=2,<3" \
sagemaker-experiments==0.* \
sagemaker-pytorch-training
# Install SageMaker AI data parallel binary (SMDDP)
# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
jq \
libhwloc-dev \
libnuma1 \
libnuma-dev \
libssl1.1 \
libtool \
hwloc \
&& rm -rf /var/lib/apt/lists/*
# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
Suggerimento
Se desideri estendere il Dockerfile personalizzato per incorporare la libreria parallela del SageMaker modello AI, consulta. Crea il tuo contenitore Docker con la Distributed Model Parallel Library SageMaker