Utilisation de l'EFA sur le DLAMI

La section suivante décrit comment utiliser EFA pour exécuter des applications à nœuds multiples sur le AWS Deep Learning AMI.

Exécution d'applications multi-nœuds avec EFA

Pour exécuter une application sur un cluster de nœuds, la configuration suivante est requise

Rubriques

Activer SSH sans mot de passe
Créer un fichier hosts.
Tests NCCL

Activer SSH sans mot de passe

Sélectionnez un nœud de votre cluster comme nœud principal. Les autres nœuds sont appelés nœuds de membre.

Sur le nœud principal, générez la paire de clés RSA.
```
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
```
Modifiez les autorisations de la clé privée sur le nœud principal.
```
chmod 600 ~/.ssh/id_rsa
```
Copiez la clé ~/.ssh/id_rsa.pub publique et ajoutez-la à l'~/.ssh/authorized_keysun des nœuds membres du cluster.
Vous devriez maintenant pouvoir vous connecter directement aux nœuds de membre du nœud principal en utilisant l'adresse IP privée.
```
ssh <member private ip>
```
Désactivez la strictHostKey vérification et activez le transfert d'agent sur le nœud principal en ajoutant ce qui suit au fichier ~/.ssh/config sur le nœud principal :
```
Host *
    ForwardAgent yes
Host *
    StrictHostKeyChecking no
```
Sur les instances Amazon Linux 2, exécutez la commande suivante sur le nœud principal pour fournir les autorisations correctes au fichier de configuration :
```
chmod 600 ~/.ssh/config
```

Créer un fichier hosts.

Sur le nœud principal, créez un fichier hosts pour identifier les nœuds du cluster. Le fichier hosts doit avoir une entrée pour chaque nœud du cluster. Créez un fichier ~/hosts et ajoutez chaque nœud en utilisant l'adresse IP privée comme suit :


localhost slots=8
<private ip of node 1> slots=8
<private ip of node 2> slots=8

Tests NCCL

Note

Ces tests ont été exécutés à l'aide de la version 1.30.0 d'EFA et du plugin OFI NCCL 1.7.4.

Vous trouverez ci-dessous un sous-ensemble de tests NCCL fournis par Nvidia pour tester à la fois les fonctionnalités et les performances sur plusieurs nœuds de calcul

Instances prises en charge : P3dn, P4, P5

Test de transfert de messages NCCL sur plusieurs nœuds

nccl_message_transfer est un test simple permettant de s'assurer que le plug-in NCCL OFI fonctionne comme prévu. Le test valide la fonctionnalité des API d'établissement de connexion et de transfert de données de NCCL. Assurez-vous d'utiliser le chemin complet vers mpirun comme indiqué dans l'exemple lorsque vous exécutez des applications NCCL avec EFA. Modifiez les paramètres np et N en fonction du nombre d'instances et de GPU de votre cluster. Pour plus d'informations, consultez la documentation AWS OFI NCCL.

Le test nccl_message_transfer suivant concerne une version générique de CUDA xx.x. Vous pouvez exécuter les commandes pour n'importe quelle version de CUDA disponible dans votre instance Amazon EC2 en remplaçant la version CUDA dans le script.


$/opt/amazon/openmpi/bin/mpirun -n 2 -N 1 --hostfile hosts \
-x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:$LD_LIBRARY_PATH \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
opt/aws-ofi-nccl/tests/nccl_message_transfer

Le résultat doit être similaire à ce qui suit. Vous pouvez vérifier la sortie pour voir que l'EFA est utilisé comme fournisseur OFI.


INFO: Function: nccl_net_ofi_init Line: 1069: NET/OFI Selected Provider is efa (found 4 nics)
INFO: Function: nccl_net_ofi_init Line: 1160: NET/OFI Using transport protocol SENDRECV
INFO: Function: configure_ep_inorder Line: 261: NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
INFO: Function: configure_nccl_proto Line: 227: NET/OFI Setting NCCL_PROTO to "simple"
INFO: Function: main Line: 86: NET/OFI Process rank 1 started. NCCLNet device used on ip-172-31-13-179 is AWS Libfabric.
INFO: Function: main Line: 91: NET/OFI Received 4 network devices
INFO: Function: main Line: 111: NET/OFI Network supports communication using CUDA buffers. Dev: 3
INFO: Function: main Line: 118: NET/OFI Server: Listening on dev 3
INFO: Function: main Line: 131: NET/OFI Send connection request to rank 1
INFO: Function: main Line: 173: NET/OFI Send connection request to rank 0
INFO: Function: main Line: 137: NET/OFI Server: Start accepting requests
INFO: Function: main Line: 141: NET/OFI Successfully accepted connection from rank 1
INFO: Function: main Line: 145: NET/OFI Send 8 requests to rank 1
INFO: Function: main Line: 179: NET/OFI Server: Start accepting requests
INFO: Function: main Line: 183: NET/OFI Successfully accepted connection from rank 0
INFO: Function: main Line: 187: NET/OFI Rank 1 posting 8 receive buffers
INFO: Function: main Line: 161: NET/OFI Successfully sent 8 requests to rank 1
INFO: Function: main Line: 251: NET/OFI Got completions for 8 requests for rank 0
INFO: Function: main Line: 251: NET/OFI Got completions for 8 requests for rank 1

Test de performance NCCL à nœuds multiples sur P4D.24xlarge

Pour vérifier les performances NCCL avec EFA, exécutez le test de performance NCCL standard disponible sur le référentiel officiel des tests NCCL. Le DLAMI est fourni avec ce test déjà conçu pour CUDA XX.X. Vous pouvez également exécuter votre propre script avec EFA.

Lors de la construction de votre propre script, suivez les instructions suivantes :

Utilisez le chemin complet vers mpirun comme indiqué dans l'exemple lors de l'exécution d'applications NCCL avec EFA.
Modifiez les paramètres np et N en fonction du nombre d'instances et de GPU de votre cluster.
Ajoutez l'indicateur NCCL_DEBUG=INFO et assurez-vous que les journaux indiquent l'utilisation de l'EFA sous la forme « Le fournisseur sélectionné est EFA ».
Définissez l'emplacement du journal d'entraînement à analyser pour validation
```
TRAINING_LOG="testEFA_$(date +"%N").log"
```

Utilisez la commande watch nvidia-smi sur n'importe quel nœud de membre pour surveiller l'utilisation des GPU. Les watch nvidia-smi commandes suivantes concernent une version générique de CUDA xx.x et dépendent du système d'exploitation de votre instance. Vous pouvez exécuter les commandes pour n'importe quelle version de CUDA disponible dans votre instance Amazon EC2 en remplaçant la version CUDA dans le script.

Amazon Linux 2 :


 $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
-x NCCL_DEBUG=INFO -x --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH \
--hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
/usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -x NCCL_PROTO=simple -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}

Ubuntu 20.04 :


$ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
-x NCCL_DEBUG=INFO -x --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:$LD_LIBRARY_PATH \
--hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
/usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -x NCCL_PROTO=simple-b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}

Le résultat doit être similaire à ce qui suit :


# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   9591 on ip-172-31-4-37 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid   9592 on ip-172-31-4-37 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid   9593 on ip-172-31-4-37 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid   9594 on ip-172-31-4-37 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid   9595 on ip-172-31-4-37 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid   9596 on ip-172-31-4-37 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid   9597 on ip-172-31-4-37 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid   9598 on ip-172-31-4-37 device  7 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid  10216 on ip-172-31-13-179 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid  10217 on ip-172-31-13-179 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid  10218 on ip-172-31-13-179 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid  10219 on ip-172-31-13-179 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid  10220 on ip-172-31-13-179 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid  10221 on ip-172-31-13-179 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid  10222 on ip-172-31-13-179 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid  10223 on ip-172-31-13-179 device  7 [0xa0] NVIDIA A100-SXM4-40GB
ip-172-31-4-37:9591:9591 [0] NCCL INFO Bootstrap : Using ens32:172.31.4.37
ip-172-31-4-37:9591:9591 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-172-31-4-37:9591:9591 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-172-31-4-37:9591:9591 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.5+cuda12.2
...
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.4-aws
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Using CUDA runtime version 11070
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Configuring AWS-specific options
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Using CUDA runtime version 11070
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Configuring AWS-specific options
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Setting provider_filter to efa
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Disabling NVLS support due to NCCL version 21602
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Setting provider_filter to efa
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Disabling NVLS support due to NCCL version 21602
ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
-----------------------------some output truncated-----------------------------------
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           0             0     float     sum      -1    11.02    0.00    0.00      0    11.04    0.00    0.00      0
           0             0     float     sum      -1    11.01    0.00    0.00      0    11.00    0.00    0.00      0
           0             0     float     sum      -1    11.02    0.00    0.00      0    11.02    0.00    0.00      0
           0             0     float     sum      -1    11.01    0.00    0.00      0    11.00    0.00    0.00      0
           0             0     float     sum      -1    11.02    0.00    0.00      0    11.02    0.00    0.00      0
         256             4     float     sum      -1    632.7    0.00    0.00      0    628.2    0.00    0.00      0
         512             8     float     sum      -1    627.4    0.00    0.00      0    629.6    0.00    0.00      0
        1024            16     float     sum      -1    632.2    0.00    0.00      0    631.7    0.00    0.00      0
        2048            32     float     sum      -1    631.0    0.00    0.00      0    634.2    0.00    0.00      0
        4096            64     float     sum      -1    623.3    0.01    0.01      0    633.6    0.01    0.01      0
        8192           128     float     sum      -1    635.1    0.01    0.01      0    633.5    0.01    0.01      0
       16384           256     float     sum      -1    634.8    0.03    0.02      0    637.0    0.03    0.02      0
       32768           512     float     sum      -1    647.9    0.05    0.05      0    636.8    0.05    0.05      0
       65536          1024     float     sum      -1    658.9    0.10    0.09      0    667.0    0.10    0.09      0
      131072          2048     float     sum      -1    671.9    0.20    0.18      0    662.9    0.20    0.19      0
      262144          4096     float     sum      -1    692.1    0.38    0.36      0    685.1    0.38    0.36      0
      524288          8192     float     sum      -1    715.3    0.73    0.69      0    696.6    0.75    0.71      0
     1048576         16384     float     sum      -1    734.6    1.43    1.34      0    729.2    1.44    1.35      0
     2097152         32768     float     sum      -1    785.9    2.67    2.50      0    794.5    2.64    2.47      0
     4194304         65536     float     sum      -1    837.2    5.01    4.70      0    837.6    5.01    4.69      0
     8388608        131072     float     sum      -1    929.2    9.03    8.46      0    931.4    9.01    8.44      0
    16777216        262144     float     sum      -1   1773.6    9.46    8.87      0   1772.8    9.46    8.87      0
    33554432        524288     float     sum      -1   2110.2   15.90   14.91      0   2116.1   15.86   14.87      0
    67108864       1048576     float     sum      -1   2650.9   25.32   23.73      0   2658.1   25.25   23.67      0
   134217728       2097152     float     sum      -1   3943.1   34.04   31.91      0   3945.9   34.01   31.89      0
   268435456       4194304     float     sum      -1   7216.5   37.20   34.87      0   7178.6   37.39   35.06      0
   536870912       8388608     float     sum      -1    13680   39.24   36.79      0    13676   39.26   36.80      0
[  1073741824      16777216    float     sum      -1    25645   41.87   39.25      0    25497   42.11   39.48      0 ] <- Used For Benchmark
...
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.46044

Pour vérifier que les tests EFA ont renvoyé un résultat valide, veuillez utiliser les tests suivants pour confirmer :

Obtenez le type d'instance à l'aide des métadonnées d'instance EC2 :


TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_TYPE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-type)

Exécutez le Tests de performance

Définissez les paramètres suivants


CUDA_VERSION
CUDA_RUNTIME_VERSION
NCCL_VERSION

Validez les résultats comme indiqué :


RETURN_VAL=`echo $?`
if [ ${RETURN_VAL} -eq 0 ]; then

    # Information on how the version come from logs
    # 
    # ip-172-31-27-205:6427:6427 [0] NCCL INFO cudaDriverVersion 12020
    # NCCL version 2.16.2+cuda11.8
    # ip-172-31-27-205:6427:6820 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws
    # ip-172-31-27-205:6427:6820 [0] NCCL INFO NET/OFI Using CUDA runtime version 11060

    # cudaDriverVersion 12020  --> This is max supported cuda version by nvidia driver
    # NCCL version 2.16.2+cuda11.8 --> This is NCCL version compiled with cuda version
    # Using CUDA runtime version 11060 --> This is selected cuda version

    # Validation of logs
    grep "NET/OFI Using CUDA runtime version ${CUDA_RUNTIME_VERSION}" ${TRAINING_LOG} || { echo "Runtime cuda text not found"; exit 1; }
    grep "NET/OFI Initializing aws-ofi-nccl" ${TRAINING_LOG} || { echo "aws-ofi-nccl is not working, please check if it is installed correctly"; exit 1; }
    grep "NET/OFI Configuring AWS-specific options" ${TRAINING_LOG} || { echo "AWS-specific options text not found"; exit 1; } 
    grep "Using network AWS Libfabric" ${TRAINING_LOG} || { echo "AWS Libfabric text not found"; exit 1; } 
    grep "busbw" ${TRAINING_LOG} || { echo "busbw text not found"; exit 1; } 
    grep "Avg bus bandwidth " ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } 
    grep "NCCL version $NCCL_VERSION" ${TRAINING_LOG} || { echo "Text not found: NCCL version $NCCL_VERSION"; exit 1; }

    if [[ ${INSTANCE_TYPE} == "p4d.24xlarge" ]]; then
        grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Text not found: NET/AWS Libfabric/0/GDRDMA"; exit 1; }  
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }  
        grep "aws-ofi-nccl/xml/p4d-24xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; }      
    elif [[ ${INSTANCE_TYPE} == "p4de.24xlarge" ]]; then
        grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "aws-ofi-nccl/xml/p4de-24xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; }  
    elif [[ ${INSTANCE_TYPE} == "p5.48xlarge" ]]; then
        grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "aws-ofi-nccl/xml/p5.48xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; }  
    elif [[ ${INSTANCE_TYPE} == "p3dn.24xlarge" ]]; then
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }  
    fi
    echo "***************************** check_efa_nccl_all_reduce passed for cuda version ${CUDA_VERSION} *****************************"
else
    echo "***************************** check_efa_nccl_all_reduce failed for cuda version ${CUDA_VERSION} *****************************"
fi

Pour accéder aux données de référence, nous pouvons analyser la dernière ligne du résultat du tableau issu du test all_reduce à nœuds multiples :


benchmark=$(sudo cat ${TRAINING_LOG} | grep '1073741824' | tail -n1 | awk -F " " '{{print $12}}' | sed 's/ //' | sed  's/  5e-07//')
if [[ -z "${benchmark}" ]]; then
  echo "benchmark variable is empty"
  exit 1
fi

echo "Benchmark throughput: ${benchmark}"

Avertissement JavaScript est désactivé ou n'est pas disponible dans votre navigateur.

Pour que vous puissiez utiliser la documentation AWS, Javascript doit être activé. Vous trouverez des instructions sur les pages d'aide de votre navigateur.

Conventions de rédaction

Lancement d'une AWS Deep Learning AMI instance

Optimisation et surveillance des GPU