Network Tuning¶
Network Interfaces¶
By default, BeeGFS clients and servers try to connect via any available
network interface that supports TCP or OFED ibverbs (preferring ibverbs
interfaces) and switch to a different one if the primary network fails. If you do not want to
allow this behavior, the configuration options connInterfacesFile
and connNetFilterFile
can
be used to allow only specific interfaces or to connect only to specific IP address ranges. By
listing an allowed interface higher than the others in the interfaces file, you can also assign
priority to specific interfaces. (E.g. if you want to prefer eth1 over eth0, put eth1 in the
first line and eth0 in the second line).
Ethernet and TCP¶
Network tuning can be applied to clients and servers. It is often based on adjustments to the common
TCP settings. The most important settings here include the TCP window scaling, buffer sizes, and
timestamps. These settings can be changed by writing to the files in /proc/sys/net/ipv4
or by using
sysctl
. Network hardware providers typically list a number of recommended tuning settings on their
websites. Your distribution might also come with different TCP implementations, which provide
better optimizations for high-speed networks than the standard TCP implementation.
For Ethernet, it is also very important to enable send and receive flow-control on the network cards
(e.g., with ethtool
) and on the switch. You also need to disable broadcast or storm control settings
on your Ethernet switch to make sure they don’t interfere with the highly concurrent parallel file
streams.
In some Ethernet networks, TSO (TCP segmentation offload) and GSO (generic segmentation offload) can
also cause problems with parallel file streams, such as significantly decreased throughput. Both of
them can be disabled with ethtool
, but you should only disable them if you have verified that they
are causing problems.
The activation of Jumbo Frames is another configuration that should be considered for Ethernet networks. It increases the amount of data carried by each Ethernet frame, and therefore, helps BeeGFS to achieve higher throughput. However, this configuration requires all elements of the routes between BeeGFS services (i. e. switches, routers, network cards) to be configured to accept the larger frames.
Neighbor Table Sizes for Address Resolution (ARP)¶
In networks with a large number of interfaces, the default neighbor table sizes for ARP (Address Resolution Protocol) lookups are typically too small, causing a “Neighbour table overflow” error. This is especially relevant for the BeeGFS management service, which communicates with all other BeeGFS services and will shutdown in such cases due to a critical communication error that prevents it from monitoring registered BeeGFS services correctly.
To prevent a neighbor table overflow system error from happening, raise the threshold value
net.ipv4.neigh.default.gc_thresh1
in file /etc/sysctl.conf
(or use the sysctl
tool). Its
default value of 128 must be raised if the system has more than 128 interfaces that can be used by
BeeGFS. For example, a system composed of 129 nodes with 1 network interface (IP address) each, or
65 nodes with 2 interfaces each, or 43 nodes with 3 interfaces.
In other words, the gc_tresh1
value should be higher than the number of all IPs that are used by
BeeGFS. So, if you have 200 clients with 2 IP addresses each and 10 servers with 3 IP addresses
each, then gc_thresh1
should be at least 200 * 2 + 10 * 3 = 460
. Since there might also be
communication between BeeGFS hosts and other machines from the network or the Internet, it would be
a good idea to round up the value, e.g., to something 512 in this example.
In addition, the other gc_thresh
threshold values should also be raised. You could double them, e.g.,
setting gc_thresh2=1024
and gc_thresh3=2048
.
From the ARP man page:
- gc_thresh1 (since Linux 2.2)
The minimum number of entries to keep in the ARP cache. The garbage collector will not run if there are fewer than this number of entries in the cache. Defaults to 128.
- gc_thresh2 (since Linux 2.2)
The soft maximum number of entries to keep in the ARP cache. The garbage collector will allow the number of entries to exceed this for 5 seconds before collection will be performed. Defaults to 512.
- gc_thresh3 (since Linux 2.2)
The hard maximum number of entries to keep in the ARP cache. The garbage collector will always run if there are more than this number of entries in the cache. Defaults to 1024.
Finally, as this problem affects any process that communicates with too many hosts, you might want
to increase these threshold values on other machines as well, not only on the machine running
beegfs-mgmtd
.
Firewalls / Network Address Translation (NAT)¶
TCP connections are only established from clients to servers or between servers, but never from a server to a client.
TCP ports used by the services can be found in the corresponding configuration files
(/etc/beegfs/beegfs-...conf
) or by querying the management service, e.g., for management service
ports:
$ beegfs-ctl --listnodes --nodetype=management --nicdetails
All BeeGFS services use fixed TCP ports. The only exception are the beegfs-ctl
and
beegfs-fsck
tools.
In general, it is not required that beegfs-ctl
can run on the compute nodes, but it is helpful
for users if this is possible, e.g., to be able to check statistics (beegfs-ctl --userstats
) or
to query quota information (beegfs-ctl --getquota
).
By default, the client also establishes TCP connections to a userspace helper service
(beegfs-helperd
), usually running on the same machine, for DNS lookups and logging.
These are the default TCP/UDP port numbers of the BeeGFS services:
Service |
Binary |
TCP |
UDP |
---|---|---|---|
Management |
beegfs-mgmtd |
8008 |
8008 |
Metadata |
beegfs-meta |
8005 |
8005 |
Storage |
beegfs-storage |
8003 |
8003 |
Client |
beegfs-client |
8004 |
8004 |
Helper |
beegfs-helperd |
8006 |
– |
In general, it is not required that all BeeGFS services of the same type use the same TCP port, e.g., there can be some metadata services using port 8005, while other metadata services connecting to the same management service and thus being part of the same file system namespace, can use different ports. However, by default, all services of the same type of the same port.