Network Tuning

Network Interfaces

By default, BeeGFS clients and servers try to connect via any available network interface that supports TCP or OFED ibverbs (preferring ibverbs interfaces) and switch to a different one if the primary network fails. If you do not want to allow this behavior, the configuration options connInterfacesFile and connNetFilterFile can be used to allow only specific interfaces or to connect only to specific IP address ranges. By listing an allowed interface higher than the others in the interfaces file, you can also assign priority to specific interfaces. (E.g. if you want to prefer eth1 over eth0, put eth1 in the first line and eth0 in the second line).

Ethernet and TCP

Network tuning can be applied to clients and servers. It is often based on adjustments to the common TCP settings. The most important settings here include the TCP window scaling, buffer sizes, and timestamps. These settings can be changed by writing to the files in /proc/sys/net/ipv4 or by using sysctl. Network hardware providers typically list a number of recommended tuning settings on their websites. Your distribution might also come with different TCP implementations, which provide better optimizations for high-speed networks than the standard TCP implementation.

For Ethernet, it is also very important to enable send and receive flow-control on the network cards (e.g., with ethtool) and on the switch. You also need to disable broadcast or storm control settings on your Ethernet switch to make sure they don’t interfere with the highly concurrent parallel file streams.

In some Ethernet networks, TSO (TCP segmentation offload) and GSO (generic segmentation offload) can also cause problems with parallel file streams, such as significantly decreased throughput. Both of them can be disabled with ethtool, but you should only disable them if you have verified that they are causing problems.

The activation of Jumbo Frames is another configuration that should be considered for Ethernet networks. It increases the amount of data carried by each Ethernet frame, and therefore, helps BeeGFS to achieve higher throughput. However, this configuration requires all elements of the routes between BeeGFS services (i. e. switches, routers, network cards) to be configured to accept the larger frames.

Neighbor Table Sizes for Address Resolution (ARP)

In networks with a large number of interfaces, the default neighbor table sizes for ARP (Address Resolution Protocol) lookups are typically too small, causing a “Neighbour table overflow” error. This is especially relevant for the BeeGFS management service, which communicates with all other BeeGFS services and will shutdown in such cases due to a critical communication error that prevents it from monitoring registered BeeGFS services correctly.

To prevent a neighbor table overflow system error from happening, raise the threshold value net.ipv4.neigh.default.gc_thresh1 in file /etc/sysctl.conf (or use the sysctl tool). Its default value of 128 must be raised if the system has more than 128 interfaces that can be used by BeeGFS. For example, a system composed of 129 nodes with 1 network interface (IP address) each, or 65 nodes with 2 interfaces each, or 43 nodes with 3 interfaces.

In other words, the gc_tresh1 value should be higher than the number of all IPs that are used by BeeGFS. So, if you have 200 clients with 2 IP addresses each and 10 servers with 3 IP addresses each, then gc_thresh1 should be at least 200 * 2 + 10 * 3 = 460. Since there might also be communication between BeeGFS hosts and other machines from the network or the Internet, it would be a good idea to round up the value, e.g., to something 512 in this example.

In addition, the other gc_thresh threshold values should also be raised. You could double them, e.g., setting gc_thresh2=1024 and gc_thresh3=2048.

From the ARP man page:

gc_thresh1 (since Linux 2.2)

The minimum number of entries to keep in the ARP cache. The garbage collector will not run if there are fewer than this number of entries in the cache. Defaults to 128.

gc_thresh2 (since Linux 2.2)

The soft maximum number of entries to keep in the ARP cache. The garbage collector will allow the number of entries to exceed this for 5 seconds before collection will be performed. Defaults to 512.

gc_thresh3 (since Linux 2.2)

The hard maximum number of entries to keep in the ARP cache. The garbage collector will always run if there are more than this number of entries in the cache. Defaults to 1024.

Finally, as this problem affects any process that communicates with too many hosts, you might want to increase these threshold values on other machines as well, not only on the machine running beegfs-mgmtd.

Firewalls / Network Address Translation (NAT)

TCP connections are only established from clients to servers or between servers, but never from a server to a client.

TCP ports used by the services can be found in the corresponding configuration files (/etc/beegfs/beegfs-...conf) or by querying the management service, e.g., for management service ports:

$ beegfs-ctl --listnodes --nodetype=management --nicdetails

All BeeGFS services use fixed TCP ports. The only exception are the beegfs-ctl and beegfs-fsck tools. In general, it is not required that beegfs-ctl can run on the compute nodes, but it is helpful for users if this is possible, e.g., to be able to check statistics (beegfs-ctl --userstats) or to query quota information (beegfs-ctl --getquota).

By default, the client also establishes TCP connections to a userspace helper service (beegfs-helperd), usually running on the same machine, for DNS lookups and logging.

These are the default TCP/UDP port numbers of the BeeGFS services:

Service

Binary

TCP

UDP

Management

beegfs-mgmtd

8008

8008

Metadata

beegfs-meta

8005

8005

Storage

beegfs-storage

8003

8003

Client

beegfs-client

8004

8004

Helper

beegfs-helperd

8006

In general, it is not required that all BeeGFS services of the same type use the same TCP port, e.g., there can be some metadata services using port 8005, while other metadata services connecting to the same management service and thus being part of the same file system namespace, can use different ports. However, by default, all services of the same type of the same port.