RDMA Support

RDMA support for InfiniBand, RoCE (RDMA over Converged Ethernet), and Omni-Path in BeeGFS are based on the Open Fabrics Enterprise Distribution ibverbs API (http://www.openfabrics.org).

Modern Linux distributions include mature OFED drivers that are suitable for use with BeeGFS. Vendor-specific OFED distributions are also supported, but usually not required.

Clients

Unlike in earlier versions, RDMA support is available by default. If you want to use a thirdparty driver though, you have to specify it in the file /etc/beegfs/beegfs-client-autobuild.conf like this: buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/openib/include.

Make sure to rebuild the kernel module by running

# /etc/init.d/beegfs-client rebuild

Servers

Note

IP addresses are required for connection initiation on RDMA enabled interfaces. Interfaces that don’t have an IP address configured will not be picked up by the servers.

Please install the libbeegfs-ib package. BeeGFS will then enable RDMA support automatically if hardware and drivers are installed.

Verifying RDMA Connectivity

At runtime, you can check whether your RDMA devices have been discovered by using beegfs-ctl by listing all registered services and their configured network interfaces in order of preference:

$ beegfs-ctl --listnodes --nodetype=storage --details
$ beegfs-ctl --listnodes --nodetype=meta --details
$ beegfs-ctl --listnodes --nodetype=client --details

The word “RDMA” will be appended to interfaces with RDMA support.

To check whether the clients are connecting to the servers via RDMA or whether they are falling back to TCP because of configuration problems, use the following command to list established connections on a client:

$ beegfs-net

This command needs to be executed on a client node, with a mounted BeeGFS file system, since it reads information from /proc/fs/beegfs/<clientID>/X_nodes.

In addition to the commands above, the log files also provide information on established connections and connection failures (if you are using at least logLevel=3). See /var/log/beegfs-X.log on clients and servers.

Tuning

A typical source of trouble is to have the ibacm service (/etc/init.d/ibacm) running on the machines. This service causes RDMA connections attempts to stall and should be disabled in all nodes.

The following RDMA specific configuration variables exist for client, and server daemons:

Option

Description

connRDMABufNum

Number of available buffers per connection. (Default = 70)

connRDMABufSize

Size of the buffers in bytes. (Default = 8192) RDMA memory cannot be swapped out. Using large buffers can thus negatively impact applications running concurrently with BeeGFS.

connMaxInternodeNum

The number of parallel connections that a node can establish to each of the other nodes. Connections are only established when needed and are also dropped when they are idle for a while.

Small values of connRDMABufSize will cause many round-trips to transfer a given amount of data, resulting in low latencies, but high CPU usage and lower bandwidth. High values will result in higher latency, but better bandwidth and lower CPU usage. The default values of 70 * 8192 = 573440 (560 KiB) gives reasonably low latency but is still able to saturate an FDR InfiniBand link.

The amount of memory needed per connection is connRDMABufSize * connRDMABufNum * 2. Each client may require up to connMaxInternodeNum times that memory on every server it connects to. Likewise, each client may need this much memory for each server it connects to.

Additionally to client-server connections, servers may also open up to connMaxInternodeNum connections between each other. When sizing a BeeGFS installation, ensure that each server can use at least connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * (#servers + #clients) bytes of memory for RDMA without degradation of service. Clients to not connect to each other and thus require less memory: connRDMABufSize * connRDMABufNum * 2 * connMaxInternodeNum * #servers

In an active system it can be expected that the maximum number of connections listed here are present for meta servers and clients, while storage servers will not usually open connections to other storage servers. As such, the following number of RDMA connections on an average system can be assumed:

From node

Number of connections

Metadata

connMaxInternodeNum * (#clients + #meta + #storage)

Storage

connMaxInternodeNum * #clients

Client

connMaxInternodeNum * (#meta + #storage)

Other

0

Intel/QLogic TrueScale

Adjust the RDMA buffer parameters in /etc/beegfs/beegfs-client.conf:

  • connRDMABufNum = 12

  • connRDMABufSize = 64KiB

Install the additional libipathverbs package.

The ib_qib module needs to be tuned at least on the server side. Add the following line to either /etc/modprobe.conf or /etc/modprobe.d/ib_qib.conf:

options ib_qib singleport=1 krcvqs=4 rcvhdrcnt=4096

The optimal value of krcvqs depends on the number of CPU cores. This value reserves the given number of receive queues for ibverbs. Please see Intel/QLogic OFED release notes for more details.

On large clusters, you might need to adapt parameters on the servers to allow accepting a higher number of incoming RDMA connections. For example:

Add the following driver options to the ib_qib line:

ib_qib lkey_table_size=18, max_qp_wrs=131072, max_qps=131072, qp_table_size=2048

Then also increase the map count (use sysctl to make this change persistent):

echo 1000000 > /proc/sys/vm/max_map_count

And increase the maximum number of file handles (use /etc/security/limits to make this change persistent):

ulimit -n 262144

Intel Omni-Path

Adjust the RDMA buffer parameters in /etc/beegfs/beegfs-client.conf:

  • connRDMABufNum = 12

  • connRDMABufSize = 64KiB

Intel Omni-Path provides a mode called “Accelerated RDMA” to improve performance of large transfers, which is off by default. See Intel Omni-Path Performance Tuning Guide chapter “Accelerated RDMA” for information on how to enable this mode.

Mellanox InfiniBand

On large clusters, you might need to set the log_mtts_per_seg and log_num_mtt options for the mlx driver to allow a higher number of RDMA connections. This is typically set in /etc/modprobe.d/mlx4_core.conf.

The default settings for connRDMABufSize and connRDMABufNum are fine for Mellanox DDR, QDR, and FDR.

For EDR use the following:

  • connRDMABufNum = 22

  • connRDMABufSize = 32KiB

Additional Notes

In an RDMA-capable cluster, some BeeGFS communication (especially communication with the management service, which is not performance-critical) uses TCP/IP transfer. On some systems, the default “connected” IP-over-IB mode of InfiniBand and Omni-Path does not seem to work well and results in spurious problems. In this case, you should try to switch the IPoIB mode to “datagram” on all hosts:

$ echo datagram > /sys/class/net/ibX/mode

See also