GPUDirect Storage Support¶
NVIDIA GPUDirect Storage (GDS) is part of the NVIDIA Magnum IO SDK that enables direct memory access directly between a GPU and RDMA NIC.
The benefits of GDS are similar to those of RDMA: higher throughput, lower latency and reduced CPU utilization. NVIDIA provides an overview of GDS in their blog.
BeeGFS implements support for GDS in the BeeFGS client and beegfs-storage services.
Clients¶
The specific client requirements for GDS are dependent upon the NVIDIA requirements. Currently, these are:
Ubuntu 18, Ubuntu 20 or RHEL 8. x86_64 architecture only.
CUDA 11.5.1 - the first version of CUDA to have GDS support for BeeGFS
Mellanox OFED 5.4
nvidia-fs 2.9.5-1
NVIDIA datacenter class GPU, Tesla or newer architecture
Mellanox ConnectX-5 or newer RDMA NIC
MOFED, CUDA and nvidia-fs should be installed before installing BeeGFS on the client node. See https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html
BeeGFS client is setup as normal via the /opt/beegfs/sbin/beegfs-setup-client
utility.
GDS is enabled in the client by adding the NVFS_H_PATH
parameter in the file /etc/beegfs/beegfs-client-autobuild.conf
.
NVFS_H_PATH
points to the directory that contains nvfs.h
, which is under the MOFED source directory.
OFED_INCLUDE_PATH
should be configured to point at the MOFED headers.
Example:
buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include NVFS_H_PATH=/usr/src/mlnx-ofed-kernel-5.4/drivers/nvme/host
It doesn’t matter that nvfs.h
is under the nvme driver directory. BeeGFS client and the nvme driver use the same
version of that header file.
Make sure to rebuild the kernel module by running
# /etc/init.d/beegfs-client rebuild
There are no additional beegfs-client.conf
configurations for GDS. GDS does integrate with client multi-rail support.
Servers¶
There are neither CUDA, MOFED nor OS requirements for GDS support on the server nodes other than what is required for RDMA Support.
All BeeGFS nodes must be configured for RDMA.
GDS support is enabled for BeeGFS servers through addition of the BEEGFS_NVFS=1
parameter to the build. For example:
$ make package-rpm PACKAGE_DIR=packages BEEGFS_NVFS=1 RPMBUILD_OPTS="-D 'MAKE_CONCURRENCY <n'"
Official builds of BeeGFS server packages ship with GDS support enabled starting with BeeGFS version 7.3.0.
BeeGFS server packages with GDS are installed and configured as normal. There are no additional configurations related to GDS support.
Verifying BeeGFS support in GDS¶
Start beegfs-client and verify that the file system is mounted and connections to storage and meta are RDMA with
beegfs-net
.
The client, beegfs-server and beegfs-meta logs should include the message “Built with NVFS RDMA support”. If that message does not appear then rebuild with NVFS support and install the appropriate BeeGFS packages.
BeeGFS GDS support is initally verified through the gdscheck utility.
$ /usr/local/cuda/gds/tools/gdscheck -p
Look for “BeeGFS: Supported” in the output. Also look for a list of GPUs and the phrase “supports GDS” after each one.
If “BeeGFS: Unsupported” appears, make sure that there is at least one supported GPU, NVFS_H_PATH is set in
/etc/beegfs/beegfs-client-autobuild.conf buildFlags
and that the client module has been rebuilt.
The next verification step involves reading a small file via from BeeGFS through GDS:
$ dd if=/dev/urandom of=/mnt/beegfs/test-4k bs=4K count=1
$ /usr/local/cuda/gds/tools/gdscheck -f /mnt/beegfs/test-4k
Look for each GPU to be listed in the output and the message “read_verification: pass”. If that message does not appear
then ensure that beegfs-net
indicates RDMA connections to storage and meta and that gdscheck -p
indicates
that BeeGFS is supported.
gdsio
may now be used to test BeeGFS/GDS performance. The parameters will depend upon desired workload and
available GPUs.
$ mkdir /mnt/beegfs/gdsio
$ /usr/local/cuda/gds/tools/gdsio -D /mnt/beegfs/gdsio -w 8 -d 1 -I 1 -x 0 -s 1G -i 1M
RDMA NIC priority¶
BeeGFS client uses the NVFS device priority function to determine which RDMA NIC is best to use for
I/O with a particular GPU. This feature is only enabled when multi-rail support is configured
via connRDMAInterfacesFile
.
Multi-rail support is required for device priority because BeeGFS client needs a list of RDMA NICs that are available for use. If multi-rail is not configured, BeeGFS client will select the RDMA device that can communicate with a given storage node via its IPoIB address.
NVFS device priority may return the same priority value for two or more RDMA NICs. When this is the case, the RDMA NIC will be selected from that set according to number of connections.
It is also possible that the RDMA NIC with the highest priority doesn’t have any more available connections for a given storage node. When this occurs, the next highest priority device is selected.
Each RDMA NIC is allowed (connMaxInternodeNum / rdmaNicCount)
connections per node. A workload
that is on a particular set of GPUs that prioritize a certain set of RDMA NICs may result both in
starvation of some NICs and also selection of NICs that don’t have the best priority. It may be
neceessary to move GPUs closer to other RDMA NICs, rebalance the workload and/or increase
connnMaxInternodeNum
to achieve optimal performance.
Supported Features¶
All BeeGFS file system features are supported for GDS I/O. This includes striping, ACLs, quotas, Buddy Mirroring and BeeOND.
GDS I/O requests are always O_DIRECT, per the GDS architecture specification. Thus, caching is disabled for GDS I/O requests.
BeeGFS GDS I/O requests must be 4KB block-aligned. Unaligned I/O requests (offset or length) will fail.
Tuning¶
General BeeGFS and RDMA tuning principles apply to the BeeGFS/GDS environment.
Client multi-rail should be enabled for the appropriate RDMA NICs so BeeGFS client can use NVFS priority to identify the best NIC to use with a particular GPU.
/etc/cufile.json
has many options to configure the behavior of libcufile, the user space component
of GDS. Under the beegfs
section, there are parameters rdma_dev_addr_list
and mount_table
that have no effect on BeeGFS. In the properties
section, the gds_rdma_write_support
parameter
may be set to false to disable GDS writes. Writes through libcufile will still work but are transparently
turned into POSIX I/O writes, which may improve write performance on systems that have SYS, NODE or PHB
topology between GPUs and RDMA NICs.
GPU to RDMA NIC topology may be inspected through nvidia-smi topo -m
. This command shows how each
GPU communicates with each RDMA NIC as well as each GPU’s NUMA and CPU affinity. It is a best practice
to execute a GDS application on the same NUMA zone as the GPUs it will use.