RDMA Support

RDMA support for InfiniBand, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS are based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).

Modern Linux distributions include mature OFED drivers that are suitable for use with BeeGFS. Vendor-specific OFED distributions are also supported, but usually not required.

Clients

Unlike in earlier versions, RDMA support is available by default. If you want to use a thirdparty driver though, you have to specify it in the file /etc/beegfs/beegfs-client-autobuild.conf like this: buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/openib/include.

To build the client without RDMA support, add BEEGFS_NO_RDMA=1 to buildArgs.

Make sure to rebuild the kernel module by running

# /etc/init.d/beegfs-client rebuild

Servers

Note

IP addresses are required for connection initiation on RDMA enabled interfaces. Interfaces that don’t have an IP address configured will not be picked up by the servers.

Please install the libbeegfs-ib package. BeeGFS will then enable RDMA support automatically if hardware and drivers are installed.

Verifying RDMA Connectivity

At runtime, you can check whether your RDMA devices have been discovered by using beegfs-ctl by listing all registered services and their configured network interfaces in order of preference:

$ beegfs-ctl --listnodes --nodetype=storage --details
$ beegfs-ctl --listnodes --nodetype=meta --details
$ beegfs-ctl --listnodes --nodetype=client --details

The word “RDMA” will be appended to interfaces with RDMA support.

To check whether the clients are connecting to the servers via RDMA or whether they are falling back to TCP because of configuration problems, use the following command to list established connections on a client:

$ beegfs-net

This command needs to be executed on a client node, with a mounted BeeGFS file system, since it reads information from /proc/fs/beegfs/<clientID>/X_nodes.

In addition to the commands above, the log files also provide information on established connections and connection failures (if you are using at least logLevel=3). See /var/log/beegfs-X.log on clients and servers.

Common RDMA Issues

A typical source of trouble is to have the ibacm service (/etc/init.d/ibacm) running on the machines. This service causes RDMA connections attempts to stall and should be disabled in all nodes.

Tuning

The following RDMA specific configuration variables exist for BeeGFS nodes (i.e., clients and server services). These settings are configured per node, and affect outbound connections from that node to other nodes:

Option

Description

connRDMABufNum

Number of available buffers per connection. (Default = 70)

connRDMABufSize

Size of the buffers in bytes. (Default = 8192) RDMA memory cannot be swapped out. Using large buffers can thus negatively impact applications running concurrently with BeeGFS.

connRDMAFragmentSize

Controls how contiguous memory is allocated per buffer. From a performance standpoint it is optimal if buffers are less fragmented and occupy contiguous regions of memory. However if the system runs low on memory, there is the chance memory allocation may fail if fragmentation is disabled or set to some high value. On systems with sufficient memory this can be set to “none” to avoid fragmenting buffers entirely. Otherwise this can be set to some number of bytes the connRDMABufSize is divisible by, for example: a fragment size of 4096 and buffer size of 8192 means each buffer is allocated as two regions of 4096 contiguous bytes. Alternatively this can be set to “page” to use the Linux PAGE_SIZE as the fragmentation value. (Default = page)

connMaxInternodeNum

The number of parallel connections that a node can establish to each of the other nodes. Connections are only established when needed and are also dropped when they are idle for a while.

When adjusting these settings, note small values of connRDMABufSize will cause many round-trips to transfer a given amount of data, resulting in low latencies, but high CPU usage and lower bandwidth. High values will result in higher latency, but better bandwidth and lower CPU usage. The default values of 70 * 8192 = 573440 (560 KiB) gives reasonably low latency but is still able to saturate an FDR InfiniBand link.

Because metadata messages are usually small and do not require as much buffer space as connections to storage servers, it is also possible to configure different RDMA settings for client connections to metadata servers, see the connRDMAMeta* settings in beegfs-client.conf.

Client/Server Memory Requirements for RDMA Connections

The amount of memory needed per client connection can be calculated as connRDMABufSize * connRDMABufNum * 2. We multiply by two because each RDMA connection has both a send and receive queue (a queue pair) and the buffer configuration applies to each of these queues. Clients can establish up to connMaxInternodeNum connections to each server service, so the per client memory requirements for each server service depend on both the per connection buffer settings and max number of connections. These parameters are configured on a per client basis using beegfs-client.conf. Note these parameters also determine the client-side memory requirements, though typically this is much lower than the server services because there are generally fewer server services a clients needs to connect to, versus clients that may connect to a server service.

In addition to to client connections to server services, metadata services may also open up to connMaxInternodeNum connections to other metadata and storage services. The number of connections to other services and RDMA buffer configuration can also be configured on a per metadata service basis using beegfs-meta.conf.

A simple rule of thumb when sizing a BeeGFS installation is to ensure that each node type can use at least connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * (#server-services + #clients) bytes of memory for RDMA without degradation of service. Clients do not connect to each other and thus require less memory: connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * #server-services.

More precise sizing is possible by examining the maximum possible connections for each node type, since not all types of nodes connect to each other. In BeeGFS clients connect to metadata and storage services, metadata services connect to other metadata and storage services, and storage services do not establish connections to other services. This means metadata services have incoming connections from clients, incoming and outgoing connections to other metadata services, and outgoing connections to storage services. Storage services have incoming connections from clients and metadata services. Clients have outgoing connections to metadata services and storage services.

Thus each node type may require the following number of RDMA connections:

Node Type

Number of Connections (incoming and outgoing)

Metadata

connMaxInternodeNum * (#clients + ((#meta-1) * 2) + #storage)

Storage

connMaxInternodeNum * (#clients + #meta)

Client

connMaxInternodeNum * (#meta + #storage)

Then to determine the RDMA memory requirements for each node type:

connRDMABufSize * connRDMABufNum * 2 * <number of incoming/outgoing connections to that node type>

The result is an estimate of the minimum RDMA memory requirements for each type of node in BeeGFS. Because BeeGFS nodes establish connections and drop them after some idle time (currently 70 minutes), it is unlikely all these connections will be active constantly. However it not unlikely for a specific node to see this many connections, for example the metadata service that owns the root directory, so it is wise to still size individual servers to the worst case scenario based on the types/number of server services they are running. Unused memory is not wasted as it can just be used for file system caching.

Note

Keep in mind the connMaxInternodeNum and RDMA buffer settings can be set differently on each BeeGFS node (especially across node types). Also remember when using the multi-mode option, there may be multiple BeeGFS server services (potentially for different node types) running on the same physical server. The memory requirements will need to be adjusted accordingly.

Client Multi-Rail Support

The default behavior of BeeGFS client is to use a single RDMA NIC for communications with beegfs-meta and beegfs-storage. When the client has multiple RDMA NICs, it is advantageous to configure multi-rail support.

This has been made possible in the past through configuring the client IPoIB devices in separate IPv4 subnets and binding beegfs-storage and/or beegfs-meta instances to IPoIB devices in the different subnets. While that type of configuration is still possible, BeeGFS 7.3.0 introduces explicit multi-rail support in BeeGFS client via the connRDMAInterfacesFile configuration in /etc/beegfs/beegfs-client.conf.

Explicit multi-rail support provides the following features:

  1. Specification of which client RDMA NICs to use for BeeGFS RDMA traffic.

  2. Client makes use of multiple RDMA NICs in a single IPoIB subnet.

  3. Dynamic load-balancing between RDMA NICs according to connection count.

  4. Support for selecting an RDMA NIC according to GPUDirect Storage Support NVFS device priority.

Multi-rail support will not work correctly if the client uses RDMA NICs configured in separate IPoIB subnets. Every RDMA NIC configured for client use must have IPoIB connectivity to every BeeGFS service in the cluster.

connRDMAInterfacesFile specifies the path to a file containing the names of devices the BeeGFS client should use for outbound RDMA connections. The file format is one IPoIB device name listed per line.

/etc/beegfs/beegfs-client.conf:

connRDMAInterfacesFile = /etc/beefgs/client-rdma.conf

/etc/beegfs/client-rdma.conf:

ib0
ib1

This configuration would load balance RDMA communications for BeeGFS client across ib0 and ib1 and would ignore any other RDMA NICs on that node. The interfaces configured for outbound RDMA are listed in /proc/fs/beegfs/<clientID>/client_info.

This is different behavior that what is configured by connInterfacesFile. connInterfacesFile specifies which network addresses are advertised for BeeGFS client through beegfs-mgmtd. Multi-rail support does not depend upon connInterfacesFile.

When configuring a node’s multiple IPoIB devices in the same IPv4 subnet it may be required to configure the IP routing tables and rules for multi-homed support. The primary indicator of the need for this configuration is when BeeGFS client cannot RDMA connect to all of the BeeGFS services. This is not the same task as enabling IP-forwarding between NICs; it is a routing configuration to segregrate traffic between NICs on the same IPv4 subnet. Imorove Your Multi-Home Servers With Policy Routing discusses multi-homed IPv4 configuration in detail.

The following sysctl parameters are useful for multi-home:

net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 2
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.arp_filter = 1
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.default.arp_ignore = 2

These changes may need to be performed on the client and/or server nodes.

Intel/QLogic TrueScale

Adjust the RDMA buffer parameters in /etc/beegfs/beegfs-client.conf to 12 buffers of 64 KiB each:

  • connRDMABufNum = 12

  • connRDMABufSize = 65536

Install the additional libipathverbs package.

The ib_qib module needs to be tuned at least on the server side. Add the following line to either /etc/modprobe.conf or /etc/modprobe.d/ib_qib.conf:

options ib_qib singleport=1 krcvqs=4 rcvhdrcnt=4096

The optimal value of krcvqs depends on the number of CPU cores. This value reserves the given number of receive queues for ibverbs. Please see Intel/QLogic OFED release notes for more details.

On large clusters, you might need to adapt parameters on the servers to allow accepting a higher number of incoming RDMA connections. For example:

Add the following driver options to the ib_qib line:

ib_qib lkey_table_size=18, max_qp_wrs=131072, max_qps=131072, qp_table_size=2048

Then also increase the map count (use sysctl to make this change persistent):

echo 1000000 > /proc/sys/vm/max_map_count

And increase the maximum number of file handles (use /etc/security/limits to make this change persistent):

ulimit -n 262144

Cornelis Omni-Path

Adjust the RDMA buffer parameters in /etc/beegfs/beegfs-client.conf to 12 buffers of 64 KiB each:

  • connRDMABufNum = 12

  • connRDMABufSize = 65536

Cornelis Omni-Path provides a mode called “Accelerated RDMA” to improve performance of large transfers, which is off by default. See Cornelis Omni-Path Performance Tuning Guide chapter “Accelerated RDMA” for information on how to enable this mode.

Mellanox InfiniBand

On large clusters, you might need to set the log_mtts_per_seg and log_num_mtt options for the mlx driver to allow a higher number of RDMA connections. This is typically set in /etc/modprobe.d/mlx4_core.conf.

The default settings for connRDMABufSize and connRDMABufNum are fine for Mellanox DDR, QDR, and FDR.

For EDR use the following:

  • connRDMABufNum = 22

  • connRDMABufSize = 32768

Additional Notes

In an RDMA-capable cluster, some BeeGFS communication (especially communication with the management service, which is not performance-critical) uses TCP/IP transfer. On some systems, the default “connected” IP-over-IB mode of InfiniBand and Omni-Path does not seem to work well and results in spurious problems. In this case, you should try to switch the IPoIB mode to “datagram” on all hosts:

$ echo datagram > /sys/class/net/ibX/mode

See also