RDMA support for InfiniBand, RoCE (RDMA over Converged Ethernet), and Omni-Path in
BeeGFS are based on the Open Fabrics Enterprise Distribution
Modern Linux distributions include mature OFED drivers that are suitable for use with BeeGFS. Vendor-specific OFED distributions are also supported, but usually not required.
Unlike in earlier versions, RDMA support is available by default. If you want to use a thirdparty driver though, you have to specify it in the file
/etc/beegfs/beegfs-client-autobuild.conf like this:
Make sure to rebuild the kernel module by running
# /etc/init.d/beegfs-client rebuild
IP addresses are required for connection initiation on RDMA enabled interfaces. Interfaces that don’t have an IP address configured will not be picked up by the servers.
Please install the
BeeGFS will then enable RDMA support automatically if hardware and drivers are installed.
Verifying RDMA Connectivity¶
At runtime, you can check whether your RDMA devices have been discovered by using
listing all registered services and their configured network interfaces in order of preference:
$ beegfs-ctl --listnodes --nodetype=storage --details $ beegfs-ctl --listnodes --nodetype=meta --details $ beegfs-ctl --listnodes --nodetype=client --details
The word “RDMA” will be appended to interfaces with RDMA support.
To check whether the clients are connecting to the servers via RDMA or whether they are falling back to TCP because of configuration problems, use the following command to list established connections on a client:
This command needs to be executed on a client node, with a mounted BeeGFS file system, since it
reads information from
In addition to the commands above, the log files also provide information on established connections
and connection failures (if you are using at least
clients and servers.
A typical source of trouble is to have the
ibacm service (
/etc/init.d/ibacm) running on the
machines. This service causes RDMA connections attempts to stall and should be disabled in all
The following RDMA specific configuration variables exist for client, and server daemons:
Number of available buffers per connection. (Default = 70)
Size of the buffers in bytes. (Default = 8192) RDMA memory cannot be swapped out. Using large buffers can thus negatively impact applications running concurrently with BeeGFS.
The number of parallel connections that a node can establish to each of the other nodes. Connections are only established when needed and are also dropped when they are idle for a while.
Small values of
connRDMABufSize will cause many round-trips to transfer a given amount of data,
resulting in low latencies, but high CPU usage and lower bandwidth. High values will result in
higher latency, but better bandwidth and lower CPU usage. The default values of 70 * 8192 = 573440 (560 KiB)
gives reasonably low latency but is still able to saturate an FDR InfiniBand link.
The amount of memory needed per connection is
connRDMABufSize * connRDMABufNum * 2. Each client
may require up to
connMaxInternodeNum times that memory on every server it connects to.
Likewise, each client may need this much memory for each server it connects to.
Additionally to client-server connections, servers may also open up to
connections between each other. When sizing a BeeGFS installation, ensure that each server can use
connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * (#servers + #clients)
bytes of memory for RDMA without degradation of service. Clients to not connect to each other and
thus require less memory:
connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * #servers
In an active system it can be expected that the maximum number of connections listed here are present for meta servers and clients, while storage servers will not usually open connections to other storage servers. As such, the following number of RDMA connections on an average system can be assumed:
Number of connections
Client Multi-Rail Support¶
The default behavior of BeeGFS client is to use a single RDMA NIC for communications with beegfs-meta and beegfs-storage. When the client has multiple RDMA NICs, it is advantageous to configure multi-rail support.
This has been made possible in the past through configuring the client IPoIB devices in separate
IPv4 subnets and binding beegfs-storage and/or beegfs-meta instances to IPoIB devices in the different
subnets. While that type of configuration is still possible, BeeGFS 7.3.0 introduces explicit multi-rail
support in BeeGFS client via the
connRDMAInterfacesFile configuration in
Explicit multi-rail support provides the following features:
Specification of which client RDMA NICs to use for BeeGFS RDMA traffic.
Client makes use of multiple RDMA NICs in a single IPoIB subnet.
Dynamic load-balancing between RDMA NICs according to connection count.
Support for selecting an RDMA NIC according to GPUDirect Storage Support NVFS device priority.
Multi-rail support will not work correctly if the client uses RDMA NICs configured in separate IPoIB subnets. Every RDMA NIC configured for client use must have IPoIB connectivity to every BeeGFS service in the cluster.
connRDMAInterfacesFile specifies the path to a file containing the names of devices the BeeGFS client
should use for outbound RDMA connections. The file format is one IPoIB device name listed per line.
connRDMAInterfacesFile = /etc/beefgs/client-rdma.conf
This configuration would load balance RDMA communications for BeeGFS client across ib0 and ib1 and would ignore any
other RDMA NICs on that node. The interfaces configured for outbound RDMA are listed in
This is different behavior that what is configured by
connInterfacesFile specifies which
network addresses are advertised for BeeGFS client through beegfs-mgmtd. Multi-rail support does not depend upon
When configuring a node’s multiple IPoIB devices in the same IPv4 subnet it may be required to configure the IP routing tables and rules for multi-homed support. The primary indicator of the need for this configuration is when BeeGFS client cannot RDMA connect to all of the BeeGFS services. This is not the same task as enabling IP-forwarding between NICs; it is a routing configuration to segregrate traffic between NICs on the same IPv4 subnet. Imorove Your Multi-Home Servers With Policy Routing discusses multi-homed IPv4 configuration in detail.
The following sysctl parameters are useful for multi-home:
net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_ignore = 2 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.arp_filter = 1 net.ipv4.conf.default.arp_announce = 2 net.ipv4.conf.default.arp_ignore = 2
These changes may need to be performed on the client and/or server nodes.
Adjust the RDMA buffer parameters in
connRDMABufNum = 12
connRDMABufSize = 64KiB
Install the additional
ib_qib module needs to be tuned at least on the server side. Add the following line to
options ib_qib singleport=1 krcvqs=4 rcvhdrcnt=4096
The optimal value of
krcvqs depends on the number of CPU cores. This value reserves the given
number of receive queues for
ibverbs. Please see Intel/QLogic OFED release notes for more
On large clusters, you might need to adapt parameters on the servers to allow accepting a higher number of incoming RDMA connections. For example:
Add the following driver options to the
ib_qib lkey_table_size=18, max_qp_wrs=131072, max_qps=131072, qp_table_size=2048
Then also increase the map count (use
sysctl to make this change persistent):
echo 1000000 > /proc/sys/vm/max_map_count
And increase the maximum number of file handles (use
/etc/security/limits to make this change persistent):
ulimit -n 262144
Adjust the RDMA buffer parameters in
connRDMABufNum = 12
connRDMABufSize = 64KiB
Intel Omni-Path provides a mode called “Accelerated RDMA” to improve performance of large transfers, which is off by default. See Intel Omni-Path Performance Tuning Guide chapter “Accelerated RDMA” for information on how to enable this mode.
On large clusters, you might need to set the
log_num_mtt options for
mlx driver to allow a higher number of RDMA connections. This is typically set in
The default settings for
connRDMABufNum are fine for Mellanox
DDR, QDR, and FDR.
For EDR use the following:
connRDMABufNum = 22
connRDMABufSize = 32KiB
In an RDMA-capable cluster, some BeeGFS communication (especially communication with the management service, which is not performance-critical) uses TCP/IP transfer. On some systems, the default “connected” IP-over-IB mode of InfiniBand and Omni-Path does not seem to work well and results in spurious problems. In this case, you should try to switch the IPoIB mode to “datagram” on all hosts:
$ echo datagram > /sys/class/net/ibX/mode