General Questions¶
‘Access denied’ error on the client, even with correct permissions¶
Please check if you have SELinux enabled on the client machine. If it is enabled, disabling it
should solve your problem. SELinux can be disabled by setting SELINUX=disabled
in the
configuration file /etc/selinux/config
. Afterwards, you might need to reboot your client for the
new setting to become effective.
Client refuses to mount because of an ‘unknown storage target’¶
Scenario¶
While testing BeeGFS, you removed the storage directory of a storage server, but kept the storage directory of the management server. Now the BeeGFS client refuses to mount and prints an error about an unknown storage target to the log file.
What happened to your file system¶
When you start a new beegfs-storage
daemon with a given storage directory, the daemon
initializes this directory by assigning an ID to this storage target path and registering this
target ID at the management server. When you delete this directory, the storage server creates a
new directory on next startup with a new ID and also registers this ID at the management server.
(Because the storage server cannot know what happened to the old directory and whether you might
have just moved the data to another machine, so it needs a new ID here.)
When the client starts, it performs a sanity check by querying all registered target IDs from the management server and checks whether all of them are accessible. If you removed a storage directory, this check fails and thus the client refuses to mount. (Note: This sanity check can be disabled, but it is definitely a good thing in this case and saves you from more trouble.)
Now you have two alternative options:
Solution A¶
Simply remove the storage directories of all BeeGFS services to start with a clean new file system:
Stop all the BeeGFS server daemons, i.e.
beegfs-mgmtd
,beegfs-meta
,beegfs-storage
:# systemctl stop beegfs\*
Delete (
rm -rf
) all their storage directories. The paths to the server storage directories can be looked up in the server configuration files:storeMgmtdDirectory
in configuration file/etc/beegfs/beegfs-mgmtd.conf
storeMetaDirectory
in configuration file/etc/beegfs/beegfs-meta.conf
storeStorageDirectory
in configuration file/etc/beegfs/beegfs-storage.conf
Restart the daemons.
# systemctl start beegfs\*
Now you have a fresh new file system without any of the previously registered target IDs.
Solution B¶
Unregister the invalid target ID from the management server.
For this, you would first use the beegfs-ctl
tool (the tool is part of the beegfs-utils
package on a client) to list the registered target IDs:
$ beegfs-ctl --listtargets --longnodes
Then check the contents of the file targetNumID
in your storage directory on the storage server
to find out which target ID is the current one that you want to keep.
For all other target IDs from the list, which are assigned to this storage server but are no longer
valid, use this command to unregister them from the management daemon:
$ beegfs-ctl --unmaptarget <targetID>
Afterwards, your client will no longer complain about the missing storage targets.
Note
There are options in the server config files to disallow initialization of new storage
directories and registration of new servers or targets, which are not set by default, but
should be set for production environments. See storeAllowFirstRunInit
and
sysAllowNewServers
.
Too many open files on beegfs-storage server¶
This usually happens when a user application leaks open files, e.g. it creates a lot of files and forgets to close them due to a bug in the application. (Note that open files will automatically be closed by the kernel when an application ends, so this problem is usually temporary.)
There are per-process limits and system-wide limits (accounting for all processes on a machine together) to control how many files can be kept open at the same time on a host.
To avoid applications from opening too many files at once and make sure that such application
problems do not affect servers, it makes sense to reduce the per-process limit for normal
applications of normal users to a reasonably low value, e.g. 1024 via the nofile
setting in
/etc/security/limits.conf
.
If your applications actually need to open a lot of files at the same time and you need to raise the
limit in the beegfs-storage
service, here are the steps to do this:
You can check the current limit for the maximum number of open files through the
/proc
file system, e.g. for runningbeegfs-storage
processes on a machine:$ for i in `pidof beegfs-storage`; do cat /proc/$i/limits | grep open; done Max open files 50000 50000 files
The
beegfs-storage
and thebeegfs-meta
processes can try to increase their own limits through the configuration optiontuneProcessFDLimit
, but this will be subject to the hard limits that were defined for the system. If thebeegfs-storage
service fails to increase its own limit, it will print a message line to its log file (/var/log/beegfs-storage.log
). Set the following in/etc/beegfs/beegfs-storage.conf
to let thebeegfs-storage
service try to increase its own limit to 10 million files:tuneProcessFDLimit=10000000
You can increase the system-wide limits (the limits that account for all processes together) to 20 million at runtime by using the following commands:
# sysctl -w fs.file-max=20000000 fs.file-max = 20000000 # sysctl -w fs.nr_open=20000000 fs.nr_open = 20000000
Make the changes from the previous step persistent across reboots by adding the following lines to
/etc/sysctl.conf
(or a corresponding file in the subdir/etc/sysctl.d
):fs.file-max = 20000000 fs.nr_open = 20000000
Add the following line to
/etc/security/limits.conf
(or a corresponding file in the subdir/etc/security/limits.d
) to increase the per-process limit to 10 million. If this server is not only used for BeeGFS, but also for other applications, you might want to set this only for processes owned by root.* - nofile 10000000
Now you need to close your current shell and reopen a new shell on the system make the new settings effective. You can then restart the
beegfs-storage
process from the new shell and look at its limits:$ for i in `pidof beegfs-storage`; do cat /proc/$i/limits | grep open; done Max open files 10000000 10000000 files
What needs to be done when a server hostname has changed¶
Scenario: hostname
or $HOSTNAME
report a different name than during the
BeeGFS installation and BeeGFS servers refuse to start up. Logs tell the nodeID
has changed and therefore a startup was refused.
Note that by default, node IDs are generated based on the hostname
of a server.
As IDs are not allowed to change, see here for information on how to manually
set your ID back to the previous value: Setting node or target IDs.
Change log level during runtime¶
You can use beegfs-ctl
to change the log level of any service. As in the config files, the
levels range from 1 to 5 with 5 being the most verbose.
$ beegfs-ctl --genericdebug --nodetype=meta --nodeid=1 "setloglevel 5"
Client won’t unmount due to filesystem being busy¶
Run the command below to identify which processes are keeping the mount point busy.
$ lsof /mnt/beegfs
If that is the case, you will see a list of processes as follows:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 21354 john cwd DIR 0,18 1 2 /mnt/beegfs bash 21355 mary cwd DIR 0,18 1 2 /mnt/beegfs
Close the found processes. If that doesn’t work, try to use the following command to kill them:
$ fuser -k /mnt/beegfs
Wait 5 seconds.
Retry the unmount / client stop
If you still get the same error message, you might still be able to identify the processes by running lsof (without path argument) and checking the referenced paths. If a process is keeping a handler to a path like
/mnt/beegfs/mydir1
, then the path in the new list would appear as/mydir1
(the former mountpoint removed from the path).
What happens with BeeGFS in a split brain scenario?¶
In a split brain scenario, only the nodes of one system partition remain online, which is the partition where the management service is running. The nodes of the separated partition will deny access to files until they were reconnected and could resume their communication with the management service. Actually, their services would stall and only start producing IO errors if the system remained split for too long.
This behavior is intentional, because in typical BeeGFS use-cases, users can not be allowed to access data that might be out of date. Especially if write access is allowed in both partitions. In this case, the same file could end up being modified at both partitions, during the time when they were split, making it impossible for the synchronization to occur later when the split was over.
Why do BeeGFS clients using RDMA log “beegfs: enabling unsafe global rkey” in the kernel?¶
Should I be concerned about this message?
No, this message is expected behavior on BeeGFS clients utilizing Remote Direct Memory Access (RDMA) when the kernel does not support “DMA keys” and only supports global rkeys.
What does this message mean?
Remote Direct Memory Access (RDMA) allows data to be transferred directly between the memory of two computers on a network without involving the processor, cache, or operating system of either computer. In RDMA, a ‘remote key’ or ‘rkey’ is used to permit peer RDMA READ/WRITE access to a specific region of memory that is registered with a protection domain using the shared rkey. This message indicates that the Linux kernel has logged the creation of a protection domain with the global rkey option enabled, which is a common practice in Linux RDMA software modules to optimize performance for small I/O requests. The ‘unsafe’ in the message merely highlights that the usage is not the default secure option but doesn’t necessarily imply any vulnerability or risk, especially since BeeGFS employs stringent security practices to mitigate any potential risks associated with using global rkey.
In BeeGFS, each connection has a separate protection domain. This means when BeeGFS’ uses a global rkey it is still only providing access to memory registered for a particular connection. This greatly secures the use of a global rkey while optimizing RDMA performance. It is always strongly recommended to enable Connection Based Authentication. to prevent unauthorized clients or servers from interacting with the file system.
How can I get rid of this message?
Because this warning is logged by the kernel, it is not possible for BeeGFS to suppress this message when global rkeys are used. One alternative to using global rkeys is to use a “DMA key”, though this is not necessarily considered to be safer because they both provide access to all RDMA registered memory for a particular protection domain. However opting to use a “DMA key” will prevent the logging of “enabling unsafe global rkey”.
Historically BeeGFS would automatically use the DMA key option if it was
supported by the kernel, but support for DMA keys was removed in kernel 4.9 and
later. DMA keys are now are only supported when using the NVIDIA (Mellanox) OFED
drivers. To allow users to continue using the “DMA key” option, starting in
BeeGFS 7.4.3 and 7.2.13 a new client side connRDMAKeyType
option was
introduced that can be set to “dma” to enable the use of DMA keys when the OFED
is installed.
If you wish to use the inbox RDMA drivers with kernel 4.9 or later then the only way to suppress the message is to adjust the kernel log level to only log kernel errors and above. Please note adjusting the kernel log level may result in other useful log messages being omitted, which may make future system troubleshooting difficult.
Loading the BeeGFS client module fails for “could not insert ‘beegfs’: Invalid argument”?¶
When attempting to load the BeeGFS client module, you may encounter an error
that reads: “could not insert ‘beegfs’: Invalid argument”. This FAQ provides
troubleshooting steps to resolve this issue. Begin by examining the output of
dmesg
leading up to the error and match it to one of the sections below.
beegfs: disagrees about version of symbol¶
The BeeGFS client module must be compiled to work with the exact combination of kernel and RDMA driver versions currently in use by each Linux installation it is used with. While various symbols might be flagged by the error, in nearly all cases this class of errors indicates the client module was built for a different combination of versions than is actively loaded on the system. A frequent scenario is when the NVIDIA (formerly Mellanox) OFED drivers are loaded, but the module was built for the inbox RDMA drivers provided by the kernel.
To correct:
Run
ofed_info
to check for the installation of the Mellanox OFED drivers.If you are using the “beegfs-client” package modify
/etc/beegfs/beegfs-client-autobuild.conf
to appendOFED_INCLUDE_PATH=<PATH>
to thebuildArgs=-j8
line, modifying<PATH>
to point to the appropriate headers (for detailed information, refer to the help comments in the file).If you are using the “beegfs-dkms-client” package refer to the documentation for the BeeGFS DKMS Client and Handling Third-party OFED Installations.