Configuring EFA clients
Use the following procedures to set up your Lustre client to access an EFA-enabled FSx for Lustre file system.
Topics
Installing EFA modules and configuring interfaces
To access an FSx for Lustre file system using an EFA interface, you must install the Lustre EFA modules and configure EFA interfaces. EFA is currently supported on Lustre clients running Ubuntu 22 with a kernel version of 6.8 and higher. See the Step 3: Install the EFA software in the Amazon EC2 User Guide on steps to install the EFA driver.
To configure your client instance on an EFA-enabled file system
Connect to your Amazon EC2 instance.
Copy the following script and save it as a file named
configure-efa-fsx-lustre-client.sh
.#!/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin echo "Started ${0} at $(date)" eth_intf="$(ip -br -4 a sh | grep $(hostname -i)/ | awk '{print $1}')" efa_version=$(modinfo efa | awk '/^version:/ {print $2}' | sed 's/[^0-9.]//g') min_efa_version="2.12.1" # Check the EFA driver version. Minimum v2.12.1 supported if [[ -z "$efa_version" ]]; then echo "Error: EFA driver not found" exit 1 fi if [[ "$(printf '%s\n' "$min_efa_version" "$efa_version" | sort -V | head -n1)" != "$min_efa_version" ]]; then echo "Error: EFA driver version $efa_version does not meet the minimum requirement $min_efa_version" exit 1 else echo "Using EFA driver version $efa_version" fi echo "Loading Lustre/EFA modules..." sudo /sbin/modprobe lnet sudo /sbin/modprobe kefalnd ipif_name="$eth_intf" sudo /sbin/modprobe ksocklnd sudo lnetctl lnet configure echo "Configuring TCP interface..." sudo lnetctl net del --net tcp 2> /dev/null sudo lnetctl net add --net tcp --if $eth_intf # For P5 instance type which supports 32 network cards, # by default add 8 EFA interfaces selecting every 4th device (1 per PCI bus) echo "Configuring EFA interface(s)..." instance_type="$(ec2-metadata --instance-type | awk '{ print $2 }')" num_efa_devices="$(ls -1 /sys/class/infiniband | wc -l)" echo "Found $num_efa_devices available EFA device(s)" if [[ "$instance_type" == "p5.48xlarge" || "$instance_type" == "p5e.48xlarge" ]]; then for intf in $(ls -1 /sys/class/infiniband | awk 'NR % 4 == 1'); do sudo lnetctl net add --net efa --if $intf --peer-credits 32 done else # Other instances: Configure 2 EFA interfaces by default if the instance supports multiple network cards, # or 1 interface for single network card instances # Can be modified to add more interfaces if instance type supports it sudo lnetctl net add --net efa --if $(ls -1 /sys/class/infiniband | head -n1) --peer-credits 32 if [[ $num_efa_devices -gt 1 ]]; then sudo lnetctl net add --net efa --if $(ls -1 /sys/class/infiniband | tail -n1) --peer-credits 32 fi fi echo "Setting discovery and UDSP rule" sudo lnetctl set discovery 1 sudo lnetctl udsp add --src efa --priority 0 sudo /sbin/modprobe lustre sudo lnetctl net show echo "Added $(sudo lnetctl net show | grep -c '@efa') EFA interface(s)"
Run the EFA configuration script.
sudo apt-get install amazon-ec2-utils cron sudo chmod +x configure-efa-fsx-lustre-client.sh ./configure-efa-fsx-lustre-client.sh
Use the following example commands to set up a cron job that automatically reconfigures EFA on client instances after they are rebooted:
(sudo crontab -l 2>/dev/null; echo "@reboot /path/to/configure-efa-fsx-lustre-client.sh > /var/log/configure-efa-fsx-lustre-client-output.log") | sudo crontab -
Adding or removing EFA interfaces
Each FSx for Lustre file system has a maximum limit of 1024 EFA connections across all client instances.
The configure-efa-fsx-lustre-client.sh
script automatically configures the number of
Elastic Fabric Adapter (EFA) interfaces on an EC2 instance based on the instance type. For P5 instances (p5.48xlarge
or
p5e.48xlarge
), it configures 8 EFA interfaces by default. For other instances with multiple
network cards, it configures 2 EFA interfaces. For instances with a single network card, it configures
1 EFA interface. When a client instance connects to an FSx for Lustre file system, each EFA interface configured
on the client instance counts against the 1024 EFA connection limit.
Client instances with more EFA interfaces typically support higher levels of throughput per client instance compared to client instances with fewer EFA interfaces. As long as you do not exceed the EFA connection limit, you can modify the script to increase or decrease the number of EFA interfaces per instance to optimize per-client throughput performance for your workloads.
To add an EFA interface:
sudo lnetctl net add --net efa --if
device_name
--peer-credits 32
Where device_name
is a device listed in ls -1 /sys/class/infiniband
.
To delete an EFA interface:
sudo lnetctl net del --net efa --if
device_name
Installing the GDS driver
To use GPUDirect Storage (GDS) on FSx for Lustre, you must use an Amazon EC2 P5 or G6 client instance, and the NVIDIA GDS driver with a release version 2.24.2 or higher.
To install the NVIDIA GPUDirect Storage driver on your client instance
Clone the NVIDIA/gds-nvidia-fs repository
which is available on GitHub. git clone https://github.com/NVIDIA/gds-nvidia-fs.git
After cloning the repository, use the following commands to build the driver:
cd gds-nvidia-fs/src/ export NVFS_MAX_PEER_DEVS=128 export NVFS_MAX_PCI_DEPTH=16 sudo -E make sudo insmod nvidia-fs.ko