使EFA用在 DLAMI - 深度學習 AMI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

使EFA用在 DLAMI

下節說明如何使用在上執EFA行多節點應用程式 AWS Deep Learning AMI。

使用 EFA 執行多節點應用程式

要跨節點集群運行應用程序,需要以下配置

啟用免密碼 SSH

在叢集中選取一個節點做為領導節點。其餘的節點稱為成員節點。

  1. 在引線節點上,產生RSA金鑰組。

    ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
  2. 變更領導節點上私有金鑰的許可。

    chmod 600 ~/.ssh/id_rsa
  3. 將公開金鑰複製~/.ssh/id_rsa.pub到叢集中~/.ssh/authorized_keys的成員節點,並將其附加至叢集中的成員節點。

  4. 您現在應該可以使用私有 IP 直接從領導節點登入到成員節點。

    ssh <member private ip>
  5. 透過將下列項目新增至引線節點上的 ~/.ssh/config 檔案,以停用 strictHostKey檢查並啟用前導節點上的代理程式轉送:

    Host * ForwardAgent yes Host * StrictHostKeyChecking no
  6. 在 Amazon Linux 2 執行個體上,在領導節點上執行下列命令,為設定檔提供正確的許可:

    chmod 600 ~/.ssh/config

建立主機檔案

在領導節點上,建立主機檔案以識別叢集中的節點。主機檔案對於叢集中每個節點都必須有項目。建立一個檔案 ~/hosts,並使用私有 IP 新增每個節點,如下所示:

localhost slots=8 <private ip of node 1> slots=8 <private ip of node 2> slots=8

NCCL測試

注意

這些測試已使用EFA版本 1.30.0 和OFINCCL外掛程式 1.7.4 執行。

下面列出了 Nvidia 提供的NCCL測試子集,用於在多個計算節點上測試功能和性能

支援的執行個體:P3dn、P4、P5

NCCL訊息傳輸多節點測試

nccl_message_transfer 是一個簡單的測試,以確保外掛程式如預期般運作NCCLOFI。此測試會驗證連線建立和資料傳輸APIs的NCCL功能。使用執行NCCL應用程式時,請確定您使用完整的 mpirun 路徑,如範例所示。EFA變更參數,npN根據執行個體的數量和叢集GPUs中。如需詳細資訊,請參閱AWS OFINCCL文件

下列 nccl_ 訊息傳輸測試適用於通用的 xx.x 版本。CUDA您可以取代指令碼中的CUDA版本,在 Amazon 執行EC2個體中執行任何可用CUDA版本的命令。

$/opt/amazon/openmpi/bin/mpirun -n 2 -N 1 --hostfile hosts \ -x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:$LD_LIBRARY_PATH \ --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \ opt/aws-ofi-nccl/tests/nccl_message_transfer

您的輸出看起來應該如下所示。您可以檢查輸出以查看EFA正在用作OFI提供程序。

INFO: Function: nccl_net_ofi_init Line: 1069: NET/OFI Selected Provider is efa (found 4 nics) INFO: Function: nccl_net_ofi_init Line: 1160: NET/OFI Using transport protocol SENDRECV INFO: Function: configure_ep_inorder Line: 261: NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported. INFO: Function: configure_nccl_proto Line: 227: NET/OFI Setting NCCL_PROTO to "simple" INFO: Function: main Line: 86: NET/OFI Process rank 1 started. NCCLNet device used on ip-172-31-13-179 is AWS Libfabric. INFO: Function: main Line: 91: NET/OFI Received 4 network devices INFO: Function: main Line: 111: NET/OFI Network supports communication using CUDA buffers. Dev: 3 INFO: Function: main Line: 118: NET/OFI Server: Listening on dev 3 INFO: Function: main Line: 131: NET/OFI Send connection request to rank 1 INFO: Function: main Line: 173: NET/OFI Send connection request to rank 0 INFO: Function: main Line: 137: NET/OFI Server: Start accepting requests INFO: Function: main Line: 141: NET/OFI Successfully accepted connection from rank 1 INFO: Function: main Line: 145: NET/OFI Send 8 requests to rank 1 INFO: Function: main Line: 179: NET/OFI Server: Start accepting requests INFO: Function: main Line: 183: NET/OFI Successfully accepted connection from rank 0 INFO: Function: main Line: 187: NET/OFI Rank 1 posting 8 receive buffers INFO: Function: main Line: 161: NET/OFI Successfully sent 8 requests to rank 1 INFO: Function: main Line: 251: NET/OFI Got completions for 8 requests for rank 0 INFO: Function: main Line: 251: NET/OFI Got completions for 8 requests for rank 1
大型多節點NCCL效能測試

要檢查NCCL性能EFA,請運行官方 NCCL-Testes 回購中提供的標準NCCL性能測試。此測試已經為 CUDA XX.X 建置,您可以使用類似的方式執行您自己的指令碼。DLAMI EFA

建構您自己的指令碼時,請參閱下列指引:

  • 使用執行NCCL應用程式時,如範例所示,使用 mpirun 的完整路徑。EFA

  • 根據執行個體的數量和叢集GPUs中變更參數 np 和 N。

  • 新增 NCCL _ DEBUG = INFO 旗標,並確定記錄檔將EFA使用情況指示為「選取的提供者是EFA」。

  • 設定要剖析的訓練記錄位置以進行驗證

    TRAINING_LOG="testEFA_$(date +"%N").log"

watch nvidia-smi在任何成員節點上使用指令來監視使用GPU情況。下列watch nvidia-smi指令適用於一般 CUDA xx.x 版本,且視執行個體的作業系統而定。您可以取代指令碼中的CUDA版本,在 Amazon 執行EC2個體中執行任何可用CUDA版本的命令。

  • Amazon Linux 2:

    $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \ -x NCCL_DEBUG=INFO --mca pml ^cm \ -x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH \ --hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \ /usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}
  • UBUNTU

    $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \ -x NCCL_DEBUG=INFO --mca pml ^cm \ -x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:$LD_LIBRARY_PATH \ --hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \ /usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}

您的輸出看起來應如以下所示:

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 9591 on ip-172-31-4-37 device 0 [0x10] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 9592 on ip-172-31-4-37 device 1 [0x10] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 9593 on ip-172-31-4-37 device 2 [0x20] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 9594 on ip-172-31-4-37 device 3 [0x20] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 9595 on ip-172-31-4-37 device 4 [0x90] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 9596 on ip-172-31-4-37 device 5 [0x90] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 9597 on ip-172-31-4-37 device 6 [0xa0] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 9598 on ip-172-31-4-37 device 7 [0xa0] NVIDIA A100-SXM4-40GB # Rank 8 Group 0 Pid 10216 on ip-172-31-13-179 device 0 [0x10] NVIDIA A100-SXM4-40GB # Rank 9 Group 0 Pid 10217 on ip-172-31-13-179 device 1 [0x10] NVIDIA A100-SXM4-40GB # Rank 10 Group 0 Pid 10218 on ip-172-31-13-179 device 2 [0x20] NVIDIA A100-SXM4-40GB # Rank 11 Group 0 Pid 10219 on ip-172-31-13-179 device 3 [0x20] NVIDIA A100-SXM4-40GB # Rank 12 Group 0 Pid 10220 on ip-172-31-13-179 device 4 [0x90] NVIDIA A100-SXM4-40GB # Rank 13 Group 0 Pid 10221 on ip-172-31-13-179 device 5 [0x90] NVIDIA A100-SXM4-40GB # Rank 14 Group 0 Pid 10222 on ip-172-31-13-179 device 6 [0xa0] NVIDIA A100-SXM4-40GB # Rank 15 Group 0 Pid 10223 on ip-172-31-13-179 device 7 [0xa0] NVIDIA A100-SXM4-40GB ip-172-31-4-37:9591:9591 [0] NCCL INFO Bootstrap : Using ens32:172.31.4.37 ip-172-31-4-37:9591:9591 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. ip-172-31-4-37:9591:9591 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). ip-172-31-4-37:9591:9591 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.5+cuda12.2 ... ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.4-aws ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Using CUDA runtime version 11070 ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Configuring AWS-specific options ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Using CUDA runtime version 11070 ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Configuring AWS-specific options ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Setting provider_filter to efa ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 ip-172-31-4-37:9024:9062 [6] NCCL INFO NET/OFI Disabling NVLS support due to NCCL version 21602 ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Setting provider_filter to efa ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Disabling NVLS support due to NCCL version 21602 ip-172-31-4-37:9020:9063 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml ... -----------------------------some output truncated----------------------------------- # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 float sum -1 11.02 0.00 0.00 0 11.04 0.00 0.00 0 0 0 float sum -1 11.01 0.00 0.00 0 11.00 0.00 0.00 0 0 0 float sum -1 11.02 0.00 0.00 0 11.02 0.00 0.00 0 0 0 float sum -1 11.01 0.00 0.00 0 11.00 0.00 0.00 0 0 0 float sum -1 11.02 0.00 0.00 0 11.02 0.00 0.00 0 256 4 float sum -1 632.7 0.00 0.00 0 628.2 0.00 0.00 0 512 8 float sum -1 627.4 0.00 0.00 0 629.6 0.00 0.00 0 1024 16 float sum -1 632.2 0.00 0.00 0 631.7 0.00 0.00 0 2048 32 float sum -1 631.0 0.00 0.00 0 634.2 0.00 0.00 0 4096 64 float sum -1 623.3 0.01 0.01 0 633.6 0.01 0.01 0 8192 128 float sum -1 635.1 0.01 0.01 0 633.5 0.01 0.01 0 16384 256 float sum -1 634.8 0.03 0.02 0 637.0 0.03 0.02 0 32768 512 float sum -1 647.9 0.05 0.05 0 636.8 0.05 0.05 0 65536 1024 float sum -1 658.9 0.10 0.09 0 667.0 0.10 0.09 0 131072 2048 float sum -1 671.9 0.20 0.18 0 662.9 0.20 0.19 0 262144 4096 float sum -1 692.1 0.38 0.36 0 685.1 0.38 0.36 0 524288 8192 float sum -1 715.3 0.73 0.69 0 696.6 0.75 0.71 0 1048576 16384 float sum -1 734.6 1.43 1.34 0 729.2 1.44 1.35 0 2097152 32768 float sum -1 785.9 2.67 2.50 0 794.5 2.64 2.47 0 4194304 65536 float sum -1 837.2 5.01 4.70 0 837.6 5.01 4.69 0 8388608 131072 float sum -1 929.2 9.03 8.46 0 931.4 9.01 8.44 0 16777216 262144 float sum -1 1773.6 9.46 8.87 0 1772.8 9.46 8.87 0 33554432 524288 float sum -1 2110.2 15.90 14.91 0 2116.1 15.86 14.87 0 67108864 1048576 float sum -1 2650.9 25.32 23.73 0 2658.1 25.25 23.67 0 134217728 2097152 float sum -1 3943.1 34.04 31.91 0 3945.9 34.01 31.89 0 268435456 4194304 float sum -1 7216.5 37.20 34.87 0 7178.6 37.39 35.06 0 536870912 8388608 float sum -1 13680 39.24 36.79 0 13676 39.26 36.80 0 [ 1073741824 16777216 float sum -1 25645 41.87 39.25 0 25497 42.11 39.48 0 ] <- Used For Benchmark ... # Out of bounds values : 0 OK # Avg bus bandwidth : 7.46044

要驗證EFA測試是否返回了有效的結果,請使用以下測試進行確認:

  • 使用實例元數據獲取EC2實例類型:

    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") INSTANCE_TYPE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-type)
  • 執行 性能測試

  • 設定下列參數

    CUDA_VERSION CUDA_RUNTIME_VERSION NCCL_VERSION
  • 驗證結果,如下所示:

    RETURN_VAL=`echo $?` if [ ${RETURN_VAL} -eq 0 ]; then # Information on how the version come from logs # # ip-172-31-27-205:6427:6427 [0] NCCL INFO cudaDriverVersion 12020 # NCCL version 2.16.2+cuda11.8 # ip-172-31-27-205:6427:6820 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws # ip-172-31-27-205:6427:6820 [0] NCCL INFO NET/OFI Using CUDA runtime version 11060 # cudaDriverVersion 12020 --> This is max supported cuda version by nvidia driver # NCCL version 2.16.2+cuda11.8 --> This is NCCL version compiled with cuda version # Using CUDA runtime version 11060 --> This is selected cuda version # Validation of logs grep "NET/OFI Using CUDA runtime version ${CUDA_RUNTIME_VERSION}" ${TRAINING_LOG} || { echo "Runtime cuda text not found"; exit 1; } grep "NET/OFI Initializing aws-ofi-nccl" ${TRAINING_LOG} || { echo "aws-ofi-nccl is not working, please check if it is installed correctly"; exit 1; } grep "NET/OFI Configuring AWS-specific options" ${TRAINING_LOG} || { echo "AWS-specific options text not found"; exit 1; } grep "Using network AWS Libfabric" ${TRAINING_LOG} || { echo "AWS Libfabric text not found"; exit 1; } grep "busbw" ${TRAINING_LOG} || { echo "busbw text not found"; exit 1; } grep "Avg bus bandwidth " ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } grep "NCCL version $NCCL_VERSION" ${TRAINING_LOG} || { echo "Text not found: NCCL version $NCCL_VERSION"; exit 1; } if [[ ${INSTANCE_TYPE} == "p4d.24xlarge" ]]; then grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Text not found: NET/AWS Libfabric/0/GDRDMA"; exit 1; } grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; } grep "aws-ofi-nccl/xml/p4d-24xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; } elif [[ ${INSTANCE_TYPE} == "p4de.24xlarge" ]]; then grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } grep "aws-ofi-nccl/xml/p4de-24xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; } elif [[ ${INSTANCE_TYPE} == "p5.48xlarge" ]]; then grep "NET/AWS Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } grep "aws-ofi-nccl/xml/p5.48xl-topo.xml" ${TRAINING_LOG} || { echo "Topology file not found"; exit 1; } elif [[ ${INSTANCE_TYPE} == "p3dn.24xlarge" ]]; then grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; } fi echo "***************************** check_efa_nccl_all_reduce passed for cuda version ${CUDA_VERSION} *****************************" else echo "***************************** check_efa_nccl_all_reduce failed for cuda version ${CUDA_VERSION} *****************************" fi
  • 要訪問基準數據,我們可以解析多節點 all_reduce 測試中表輸出的最後一行:

    benchmark=$(sudo cat ${TRAINING_LOG} | grep '1073741824' | tail -n1 | awk -F " " '{{print $12}}' | sed 's/ //' | sed 's/ 5e-07//') if [[ -z "${benchmark}" ]]; then echo "benchmark variable is empty" exit 1 fi echo "Benchmark throughput: ${benchmark}"