GPUDirect Storage Support

NVIDIA GPUDirect Storage (GDS) is part of the NVIDIA Magnum IO SDK that enables direct memory access directly between a GPU and RDMA NIC.

The benefits of GDS are similar to those of RDMA: higher throughput, lower latency and reduced CPU utilization. NVIDIA provides an overview of GDS in their blog.

BeeGFS 7.3.0 implements support for GDS in the BeeFGS client and beegfs-storage services.

Clients

The specific client requirements for GDS are dependent upon the NVIDIA requirements. Currently, these are:

  • Ubuntu 18, Ubuntu 20 or RHEL 8. x86_64 architecture only.

  • CUDA 11.5.1 - the first version of CUDA to have GDS support for BeeGFS

  • Mellanox OFED 5.4

  • nvidia-fs 2.9.5-1

  • NVIDIA datacenter class GPU, Tesla or newer architecture

  • Mellanox ConnectX-5 or newer RDMA NIC

MOFED, CUDA and nvidia-fs should be installed before installing BeeGFS on the client node. See https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html

BeeGFS client is setup as normal via the /opt/beegfs/sbin/beegfs-setup-client utility.

GDS is enabled in the client by adding the NVFS_H_PATH parameter in the file /etc/beegfs/beegfs-client-autobuild.conf. NVFS_H_PATH points to the directory that contains nvfs.h, which is under the MOFED source directory. OFED_INCLUDE_PATH should be configured to point at the MOFED headers.

Example:

buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include NVFS_H_PATH=/usr/src/mlnx-ofed-kernel-5.4/drivers/nvme/host

It doesn’t matter that nvfs.h is under the nvme driver directory. BeeGFS client and the nvme driver use the same version of that header file.

Make sure to rebuild the kernel module by running

# /etc/init.d/beegfs-client rebuild

There are no additional beegfs-client.conf configurations for GDS. GDS does integrate with client multi-rail support.

Servers

There are neither CUDA, MOFED nor OS requirements for GDS support on the server nodes other than what is required for RDMA Support.

All BeeGFS nodes must be configured for RDMA.

GDS support is enabled for BeeGFS servers through addition of the BEEGFS_NVFS=1 parameter to the build. For example:

$ make package-rpm PACKAGE_DIR=packages BEEGFS_NVFS=1 RPMBUILD_OPTS="-D 'MAKE_CONCURRENCY <n'"

Official builds of BeeGFS server packages ship with GDS support enabled starting with BeeGFS version 7.3.0.

BeeGFS server packages with GDS are installed and configured as normal. There are no additional configurations related to GDS support.

Verifying BeeGFS support in GDS

Start beegfs-client and verify that the file system is mounted and connections to storage and meta are RDMA with beegfs-net.

The client, beegfs-server and beegfs-meta logs should include the message “Built with NVFS RDMA support”. If that message does not appear then rebuild with NVFS support and install the appropriate BeeGFS packages.

BeeGFS GDS support is initally verified through the gdscheck utility.

$ /usr/local/cuda/gds/tools/gdscheck -p

Look for “BeeGFS: Supported” in the output. Also look for a list of GPUs and the phrase “supports GDS” after each one. If “BeeGFS: Unsupported” appears, make sure that there is at least one supported GPU, NVFS_H_PATH is set in /etc/beegfs/beegfs-client-autobuild.conf buildFlags and that the client module has been rebuilt.

The next verification step involves reading a small file via from BeeGFS through GDS:

$ dd if=/dev/urandom of=/mnt/beegfs/test-4k bs=4K count=1
$ /usr/local/cuda/gds/tools/gdscheck -f /mnt/beegfs/test-4k

Look for each GPU to be listed in the output and the message “read_verification: pass”. If that message does not appear then ensure that beegfs-net indicates RDMA connections to storage and meta and that gdscheck -p indicates that BeeGFS is supported.

gdsio may now be used to test BeeGFS/GDS performance. The parameters will depend upon desired workload and available GPUs.

$ mkdir /mnt/beegfs/gdsio
$ /usr/local/cuda/gds/tools/gdsio -D /mnt/beegfs/gdsio -w 8 -d 1 -I 1 -x 0 -s 1G -i 1M

RDMA NIC priority

BeeGFS client uses the NVFS device priority function to determine which RDMA NIC is best to use for I/O with a particular GPU. This feature is only enabled when multi-rail support is configured via connRDMAInterfacesFile.

Multi-rail support is required for device priority because BeeGFS client needs a list of RDMA NICs that are available for use. If multi-rail is not configured, BeeGFS client will select the RDMA device that can communicate with a given storage node via its IPoIB address.

NVFS device priority may return the same priority value for two or more RDMA NICs. When this is the case, the RDMA NIC will be selected from that set according to number of connections.

It is also possible that the RDMA NIC with the highest priority doesn’t have any more available connections for a given storage node. When this occurs, the next highest priority device is selected.

Each RDMA NIC is allowed (connMaxInternodeNum / rdmaNicCount) connections per node. A workload that is on a particular set of GPUs that prioritize a certain set of RDMA NICs may result both in starvation of some NICs and also selection of NICs that don’t have the best priority. It may be neceessary to move GPUs closer to other RDMA NICs, rebalance the workload and/or increase connnMaxInternodeNum to achieve optimal performance.

Supported Features

All BeeGFS file system features are supported for GDS I/O. This includes striping, ACLs, quotas, Buddy Mirroring and BeeOND.

GDS I/O requests are always O_DIRECT, per the GDS architecture specification. Thus, caching is disabled for GDS I/O requests.

BeeGFS GDS I/O requests must be 4KB block-aligned. Unaligned I/O requests (offset or length) will fail.

Tuning

General BeeGFS and RDMA tuning principles apply to the BeeGFS/GDS environment.

Client multi-rail should be enabled for the appropriate RDMA NICs so BeeGFS client can use NVFS priority to identify the best NIC to use with a particular GPU.

/etc/cufile.json has many options to configure the behavior of libcufile, the user space component of GDS. Under the beegfs section, there are parameters rdma_dev_addr_list and mount_table that have no effect on BeeGFS. In the properties section, the gds_rdma_write_support parameter may be set to false to disable GDS writes. Writes through libcufile will still work but are transparently turned into POSIX I/O writes, which may improve write performance on systems that have SYS, NODE or PHB topology between GPUs and RDMA NICs.

GPU to RDMA NIC topology may be inspected through nvidia-smi topo -m. This command shows how each GPU communicates with each RDMA NIC as well as each GPU’s NUMA and CPU affinity. It is a best practice to execute a GDS application on the same NUMA zone as the GPUs it will use.

See also