Get started with EFA and NCCL for ML workloads on Amazon EC2
The NVIDIA Collective Communications Library (NCCL) is a library of standard collective
communication routines for multiple GPUs across a single node or multiple nodes. NCCL
can be used together with EFA, Libfabric, and MPI to support various machine learning
workloads. For more information, see the NCCL
The following steps help you to get started with EFA and NCCL using a base AMI for one of the supported operating systems.
Note
-
Only the
p3dn.24xlarge
,p4d.24xlarge
,p5.48xlarge
instance types are supported. -
Only Amazon Linux 2 and Ubuntu 20.04/22.04 base AMIs are supported.
-
Only NCCL 2.4.2 and later is supported with EFA.
For more information about running machine learning workloads with EFA and NCCL using an AWS Deep Learning AMIs, see Using EFA on the DLAMI in the AWS Deep Learning AMIs Developer Guide.
Steps
- Step 1: Prepare an EFA-enabled security group
- Step 2: Launch a temporary instance
- Step 3: Install Nvidia GPU drivers, Nvidia CUDA toolkit, and cuDNN
- Step 4: Install GDRCopy
- Step 5: Install the EFA software
- Step 6: Install NCCL
- Step 7: Install the aws-ofi-nccl plugin
- Step 8: Install the NCCL tests
- Step 9: Test your EFA and NCCL configuration
- Step 10: Install your machine learning applications
- Step 11: Create an EFA and NCCL-enabled AMI
- Step 12: Terminate the temporary instance
- Step 13: Launch EFA and NCCL-enabled instances into a cluster placement group
- Step 14: Enable passwordless SSH
Step 1: Prepare an EFA-enabled security group
An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself. The following procedure creates a security group that allows all inbound and outbound traffic to and from itself, and that allows inbound SSH traffic from any IPv4 address for SSH connectivity.
Important
This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.
For other scenarios, see Security group rules for different use cases.
To create an EFA-enabled security group
Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/
. -
In the navigation pane, choose Security Groups and then choose Create security group.
-
In the Create security group window, do the following:
-
For Security group name, enter a descriptive name for the security group, such as
EFA-enabled security group
. -
(Optional) For Description, enter a brief description of the security group.
-
For VPC, select the VPC into which you intend to launch your EFA-enabled instances.
-
Choose Create security group.
-
-
Select the security group that you created, and on the Details tab, copy the Security group ID.
-
With the security group still selected, choose Actions, Edit inbound rules, and then do the following:
-
Choose Add rule.
-
For Type, choose All traffic.
-
For Source type, choose Custom and paste the security group ID that you copied into the field.
-
Choose Add rule.
-
For Type, choose SSH.
-
For Source type, choose Anywhere-IPv4.
-
Choose Save rules.
-
-
With the security group still selected, choose Actions, Edit outbound rules, and then do the following:
-
Choose Add rule.
-
For Type, choose All traffic.
-
For Destination type, choose Custom and paste the security group ID that you copied into the field.
-
Choose Save rules.
-
Step 2: Launch a temporary instance
Launch a temporary instance that you can use to install and configure the EFA software components. You use this instance to create an EFA-enabled AMI from which you can launch your EFA-enabled instances.
To launch a temporary instance
Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/
. -
In the navigation pane, choose Instances, and then choose Launch Instances to open the new launch instance wizard.
-
(Optional) In the Name and tags section, provide a name for the instance, such as
EFA-instance
. The name is assigned to the instance as a resource tag (Name=
).EFA-instance
-
In the Application and OS Images section, select an AMI for one of the supported operating systems. Only Amazon Linux 2, Ubuntu 20.04, and Ubuntu 22.04 are supported.
-
In the Instance type section, select either
p3dn.24xlarge
,p4d.24xlarge
, orp5.48xlarge
. -
In the Key pair section, select the key pair to use for the instance.
-
In the Network settings section, choose Edit, and then do the following:
-
For Subnet, choose the subnet in which to launch the instance. If you do not select a subnet, you can't enable the instance for EFA.
-
For Firewall (security groups), choose Select existing security group, and then select the security group that you created in the previous step.
-
Expand the Advanced network configuration section, and for Elastic Fabric Adapter, select Enable.
-
-
In the Storage section, configure the volumes as needed.
Note
You must provision an additional 10 to 20 GiB of storage for the Nvidia CUDA Toolkit. If you do not provision enough storage, you will receive an
insufficient disk space
error when attempting to install the Nvidia drivers and CUDA toolkit. -
In the Summary panel on the right, choose Launch instance.
Step 3: Install Nvidia GPU drivers, Nvidia CUDA toolkit, and cuDNN
Step 4: Install GDRCopy
Install GDRCopy to improve the performance of Libfabric. For more information about
GDRCopy, see the GDRCopy repository
Step 5: Install the EFA software
Install the EFA-enabled kernel, EFA drivers, Libfabric, and Open MPI stack that is required to support EFA on your temporary instance.
To install the EFA software
-
Connect to the instance you launched. For more information, see Connect to your Linux instance using SSH.
-
Download the EFA software installation files. The software installation files are packaged into a compressed tarball (
.tar.gz
) file. To download the latest stable version, use the following command.$
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.34.0.tar.gzYou can also get the latest version by replacing the version number with
latest
in the preceding command. (Optional) Verify the authenticity and integrity of the EFA tarball (
.tar.gz
) file.We recommend that you do this to verify the identity of the software publisher and to check that the file has not been altered or corrupted since it was published. If you do not want to verify the tarball file, skip this step.
Note
Alternatively, if you prefer to verify the tarball file by using an MD5 or SHA256 checksum instead, see Verify the EFA installer using a checksum.
-
Download the public GPG key and import it into your keyring.
$
wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.keyThe command should return a key value. Make a note of the key value, because you need it in the next step.
-
Verify the GPG key's fingerprint. Run the following command and specify the key value from the previous step.
$
gpg --fingerprintkey_value
The command should return a fingerprint that is identical to
4E90 91BC BB97 A96B 26B1 5E59 A054 80B1 DD2D 3CCC
. If the fingerprint does not match, don't run the EFA installation script, and contact AWS Support. -
Download the signature file and verify the signature of the EFA tarball file.
$
wget https://efa-installer.amazonaws.com/aws-efa-installer-1.34.0.tar.gz.sig && gpg --verify ./aws-efa-installer-1.34.0.tar.gz.sigThe following shows example output.
gpg: Signature made Wed 29 Jul 2020 12:50:13 AM UTC using RSA key ID DD2D3CCC gpg: Good signature from "Amazon EC2 EFA <ec2-efa-maintainers@amazon.com>" gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 4E90 91BC BB97 A96B 26B1 5E59 A054 80B1 DD2D 3CCC
If the result includes
Good signature
, and the fingerprint matches the fingerprint returned in the previous step, proceed to the next step. If not, don't run the EFA installation script, and contact AWS Support.
-
-
Extract the files from the compressed
.tar.gz
file and navigate into the extracted directory.$
tar -xf aws-efa-installer-1.34.0.tar.gz && cd aws-efa-installer -
Run the EFA software installation script.
Note
From EFA 1.30.0, both Open MPI 4 and Open MPI 5 are installed by default. Unless you need Open MPI 5, we recommend that you install only Open MPI 4. The following command installs Open MPI 4 only. If you want to install Open MPI 4 and Open MPI 5, remove
--mpi=openmpi4
.$
sudo ./efa_installer.sh -y --mpi=openmpi4Libfabric is installed in the
/opt/amazon/efa
directory, while Open MPI is installed in the/opt/amazon/openmpi
directory. -
If the EFA installer prompts you to reboot the instance, do so and then reconnect to the instance. Otherwise, log out of the instance and then log back in to complete the installation.
-
Confirm that the EFA software components were successfully installed.
$
fi_info -p efa -t FI_EP_RDMThe command should return information about the Libfabric EFA interfaces. The following example shows the command output.
-
p3dn.24xlarge
with single network interfaceprovider: efa fabric: EFA-fe80::94:3dff:fe89:1b70 domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA
-
p4d.24xlarge
andp5.48xlarge
with multiple network interfacesprovider: efa fabric: EFA-fe80::c6e:8fff:fef6:e7ff domain: efa_0-rdm version: 111.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::c34:3eff:feb2:3c35 domain: efa_1-rdm version: 111.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::c0f:7bff:fe68:a775 domain: efa_2-rdm version: 111.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::ca7:b0ff:fea6:5e99 domain: efa_3-rdm version: 111.0 type: FI_EP_RDM protocol: FI_PROTO_EFA
-
Step 6: Install NCCL
Install NCCL. For more information about NCCL, see the
NCCL repository
To install NCCL
-
Navigate to the
/opt
directory.$
cd /opt -
Clone the official NCCL repository to the instance and navigate into the local cloned repository.
$
sudo git clone https://github.com/NVIDIA/nccl.git && cd nccl -
Build and install NCCL and specify the CUDA installation directory.
$
sudo make -j src.build CUDA_HOME=/usr/local/cuda
Step 7: Install the aws-ofi-nccl plugin
The aws-ofi-nccl plugin maps NCCL's connection-oriented transport APIs to Libfabric's
connection-less reliable interface. This enables you to use Libfabric as a network provider while
running NCCL-based applications. For more information about the aws-ofi-nccl plugin, see the
aws-ofi-nccl repository
To install the aws-ofi-nccl plugin
-
Navigate to your home directory.
$
cd $HOME -
Install the required utilities.
-
Amazon Linux 2
$
sudo yum install hwloc-devel -
Ubuntu
$
sudo apt-get install libhwloc-dev
-
-
Download the aws-ofi-nccl plugin files. The files are packaged into a compressed tarball (
.tar.gz
).$
wget https://github.com/aws/aws-ofi-nccl/releases/download/v1.11.0-aws/aws-ofi-nccl-1.11.0-aws.tar.gz -
Extract the files from the compressed .tar.gz file and navigate into the extracted directory.
$
tar -xf aws-ofi-nccl-1.11.0-aws.tar.gz && cd aws-ofi-nccl-1.11.0-aws -
To generate the make files, run the
configure
script and specify the MPI, Libfabric, NCCL, and CUDA installation directories.$
./configure --prefix=/opt/aws-ofi-nccl --with-mpi=/opt/amazon/openmpi \ --with-libfabric=/opt/amazon/efa \ --with-cuda=/usr/local/cuda \ --enable-platform-aws -
Add the Open MPI directory to the
PATH
variable.$
export PATH=/opt/amazon/openmpi/bin/:$PATH -
Install the aws-ofi-nccl plugin.
$
make && sudo make install
Step 8: Install the NCCL tests
Install the NCCL tests. The NCCL tests enable you to confirm that NCCL is properly
installed and that it is operating as expected. For more information about the
NCCL tests, see the nccl-tests
repository
To install the NCCL tests
-
Navigate to your home directory.
$
cd $HOME -
Clone the official nccl-tests repository to the instance and navigate into the local cloned repository.
$
git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests -
Add the Libfabric directory to the
LD_LIBRARY_PATH
variable.-
Amazon Linux 2
$
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64
:$LD_LIBRARY_PATH -
Ubuntu
$
export LD_LIBRARY_PATH=/opt/amazon/efa/lib
:$LD_LIBRARY_PATH
-
-
Install the NCCL tests and specify the MPI, NCCL, and CUDA installation directories.
$
make MPI=1 MPI_HOME=/opt/amazon/openmpi
NCCL_HOME=/opt/nccl/build
CUDA_HOME=/usr/local/cuda
Step 9: Test your EFA and NCCL configuration
Run a test to ensure that your temporary instance is properly configured for EFA and NCCL.
To test your EFA and NCCL configuration
-
Create a host file that specifies the hosts on which to run the tests. The following command creates a host file named
my-hosts
that includes a reference to the instance itself. -
Run the test and specify the host file (
--hostfile
) and the number of GPUs to use (-n
). The following command runs theall_reduce_perf
test on 8 GPUs on the instance itself, and specifies the following environment variables.-
FI_EFA_USE_DEVICE_RDMA=1
—(p4d.24xlarge
only) uses the device's RDMA functionality for one-sided and two-sided transfer. -
NCCL_DEBUG=INFO
—enables detailed debugging output. You can also specifyVERSION
to print only the NCCL version at the start of the test, orWARN
to receive only error messages.
For more information about the NCCL test arguments, see the NCCL Tests README
in the official nccl-tests repository. -
p3dn.24xlarge
$
/opt/amazon/openmpi/bin/mpirun \ -x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \ -x NCCL_DEBUG=INFO \ --hostfile my-hosts -n 8 -N 8 \ --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \ $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 -
p4d.24xlarge
andp5.48xlarge
$
/opt/amazon/openmpi/bin/mpirun \ -x FI_EFA_USE_DEVICE_RDMA=1 \ -x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \ -x NCCL_DEBUG=INFO \ --hostfile my-hosts -n 8 -N 8 \ --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \ $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
-
-
You can confirm that EFA is active as the underlying provider for NCCL when the
NCCL_DEBUG
log is printed.ip-192-168-2-54:14:14 [0] NCCL INFO NET/OFI Selected Provider is efa*
The following additional information is displayed when using a
p4d.24xlarge
instance.ip-192-168-2-54:14:14 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/ec2-user/install/plugin/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
Step 10: Install your machine learning applications
Install the machine learning applications on the temporary instance. The installation procedure varies depending on the specific machine learning application. For more information about installing software on your Linux instance, see Manage software on your Amazon Linux 2 instance.
Note
Refer to your machine learning application’s documentation for installation instructions.
Step 11: Create an EFA and NCCL-enabled AMI
After you have installed the required software components, you create an AMI that you can reuse to launch your EFA-enabled instances.
To create an AMI from your temporary instance
Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/
. -
In the navigation pane, choose Instances.
-
Select the temporary instance that you created and choose Actions, Image, Create image.
-
For Create image, do the following:
-
For Image name, enter a descriptive name for the AMI.
-
(Optional) For Image description, enter a brief description of the purpose of the AMI.
-
Choose Create image.
-
-
In the navigation pane, choose AMIs.
-
Locate the AMI tht you created in the list. Wait for the status to change from
pending
toavailable
before continuing to the next step.
Step 12: Terminate the temporary instance
At this point, you no longer need the temporary instance that you launched. You can terminate the instance to stop incurring charges for it.
To terminate the temporary instance
Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/
. -
In the navigation pane, choose Instances.
-
Select the temporary instance that you created and then choose Actions, Instance state, Terminate instance.
-
When prompted for confirmation, choose Terminate.
Step 13: Launch EFA and NCCL-enabled instances into a cluster placement group
Launch your EFA and NCCL-enabled instances into a cluster placement group using the EFA-enabled AMI and the EFA-enabled security group that you created earlier.
Note
-
It is not an absolute requirement to launch your EFA-enabled instances into a cluster placement group. However, we do recommend running your EFA-enabled instances in a cluster placement group as it launches the instances into a low-latency group in a single Availability Zone.
-
To ensure that capacity is available as you scale your cluster’s instances, you can create a Capacity Reservation for your cluster placement group. For more information, see Create Capacity Reservations in cluster placement groups.
Step 14: Enable passwordless SSH
To enable your applications to run across all of the instances in your cluster, you must enable passwordless SSH access from the leader node to the member nodes. The leader node is the instance from which you run your applications. The remaining instances in the cluster are the member nodes.
To enable passwordless SSH between the instances in the cluster
-
Select one instance in the cluster as the leader node, and connect to it.
-
Disable
strictHostKeyChecking
and enableForwardAgent
on the leader node. Open~/.ssh/config
using your preferred text editor and add the following.Host * ForwardAgent yes Host * StrictHostKeyChecking no
-
Generate an RSA key pair.
$
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsaThe key pair is created in the
$HOME/.ssh/
directory. -
Change the permissions of the private key on the leader node.
$
chmod 600 ~/.ssh/id_rsa chmod 600 ~/.ssh/config -
Open
~/.ssh/id_rsa.pub
using your preferred text editor and copy the key. -
For each member node in the cluster, do the following:
-
Connect to the instance.
-
Open
~/.ssh/authorized_keys
using your preferred text editor and add the public key that you copied earlier.
-
-
To test that the passwordless SSH is functioning as expected, connect to your leader node and run the following command.
$
sshmember_node_private_ip
You should connect to the member node without being prompted for a key or password.