General Questions

‘Access denied’ error on the client, even with correct permissions

Please check if you have SELinux enabled on the client machine. If it is enabled, disabling it should solve your problem. SELinux can be disabled by setting SELINUX=disabled in the configuration file /etc/selinux/config. Afterwards, you might need to reboot your client for the new setting to become effective.

Client refuses to mount because of an ‘unknown storage target’

Scenario

While testing BeeGFS, you removed the storage directory of a storage server, but kept the storage directory of the management server. Now the BeeGFS client refuses to mount and prints an error about an unknown storage target to the log file.

What happened to your file system

When you start a new beegfs-storage daemon with a given storage directory, the daemon initializes this directory by assigning an ID to this storage target path and registering this target ID at the management server. When you delete this directory, the storage server creates a new directory on next startup with a new ID and also registers this ID at the management server. (Because the storage server cannot know what happened to the old directory and whether you might have just moved the data to another machine, so it needs a new ID here.)

When the client starts, it performs a sanity check by querying all registered target IDs from the management server and checks whether all of them are accessible. If you removed a storage directory, this check fails and thus the client refuses to mount. (Note: This sanity check can be disabled, but it is definitely a good thing in this case and saves you from more trouble.)

Now you have two alternative options:

Solution A

Simply remove the storage directories of all BeeGFS services to start with a clean new file system:

  1. Stop all the BeeGFS server daemons, i.e. beegfs-mgmtd, beegfs-meta, beegfs-storage:

    # systemctl stop beegfs\*
    
  2. Delete (rm -rf) all their storage directories. The paths to the server storage directories can be looked up in the server configuration files:

    • storeMgmtdDirectory in configuration file /etc/beegfs/beegfs-mgmtd.conf

    • storeMetaDirectory in configuration file /etc/beegfs/beegfs-meta.conf

    • storeStorageDirectory in configuration file /etc/beegfs/beegfs-storage.conf

  3. Restart the daemons.

    # systemctl start beegfs\*
    

Now you have a fresh new file system without any of the previously registered target IDs.

Solution B

Unregister the invalid target ID from the management server. For this, you would first use the beegfs-ctl tool (the tool is part of the beegfs-utils package on a client) to list the registered target IDs:

$ beegfs-ctl --listtargets --longnodes

Then check the contents of the file targetNumID in your storage directory on the storage server to find out which target ID is the current one that you want to keep. For all other target IDs from the list, which are assigned to this storage server but are no longer valid, use this command to unregister them from the management daemon:

$ beegfs-ctl --unmaptarget <targetID>

Afterwards, your client will no longer complain about the missing storage targets.

Note

There are options in the server config files to disallow initialization of new storage directories and registration of new servers or targets, which are not set by default, but should be set for production environments. See storeAllowFirstRunInit and sysAllowNewServers.

Too many open files on beegfs-storage server

This usually happens when a user application leaks open files, e.g. it creates a lot of files and forgets to close them due to a bug in the application. (Note that open files will automatically be closed by the kernel when an application ends, so this problem is usually temporary.)

There are per-process limits and system-wide limits (accounting for all processes on a machine together) to control how many files can be kept open at the same time on a host.

To avoid applications from opening too many files at once and make sure that such application problems do not affect servers, it makes sense to reduce the per-process limit for normal applications of normal users to a reasonably low value, e.g. 1024 via the nofile setting in /etc/security/limits.conf.

If your applications actually need to open a lot of files at the same time and you need to raise the limit in the beegfs-storage service, here are the steps to do this:

  1. You can check the current limit for the maximum number of open files through the /proc file system, e.g. for running beegfs-storage processes on a machine:

    $ for i in `pidof beegfs-storage`; do cat /proc/$i/limits | grep open; done
    Max open files            50000             50000             files
    
  2. The beegfs-storage and the beegfs-meta processes can try to increase their own limits through the configuration option tuneProcessFDLimit, but this will be subject to the hard limits that were defined for the system. If the beegfs-storage service fails to increase its own limit, it will print a message line to its log file (/var/log/beegfs-storage.log). Set the following in /etc/beegfs/beegfs-storage.conf to let the beegfs-storage service try to increase its own limit to 10 million files:

    tuneProcessFDLimit=10000000
    
  3. You can increase the system-wide limits (the limits that account for all processes together) to 20 million at runtime by using the following commands:

    # sysctl -w fs.file-max=20000000
    fs.file-max = 20000000
    # sysctl -w fs.nr_open=20000000
    fs.nr_open = 20000000
    
  4. Make the changes from the previous step persistent across reboots by adding the following lines to /etc/sysctl.conf (or a corresponding file in the subdir /etc/sysctl.d):

    fs.file-max = 20000000
    fs.nr_open = 20000000
    
  5. Add the following line to /etc/security/limits.conf (or a corresponding file in the subdir /etc/security/limits.d) to increase the per-process limit to 10 million. If this server is not only used for BeeGFS, but also for other applications, you might want to set this only for processes owned by root.

    *                -      nofile          10000000
    
  6. Now you need to close your current shell and reopen a new shell on the system make the new settings effective. You can then restart the beegfs-storage process from the new shell and look at its limits:

    $ for i in `pidof beegfs-storage`; do cat /proc/$i/limits | grep open; done
    Max open files            10000000             10000000             files
    

What needs to be done when a server hostname has changed

Scenario: hostname or $HOSTNAME report a different name than during the BeeGFS installation and BeeGFS servers refuse to start up. Logs tell the nodeID has changed and therefore a startup was refused.

Note that by default, node IDs are generated based on the hostname of a server. As IDs are not allowed to change, see here for information on how to manually set your ID back to the previous value: Setting node or target IDs.

Change log level during runtime

You can use beegfs-ctl to change the log level of any service. As in the config files, the levels range from 1 to 5 with 5 being the most verbose.

$ beegfs-ctl --genericdebug --nodetype=meta --nodeid=1 "setloglevel 5"

Client won’t unmount due to filesystem being busy

  1. Run the command below to identify which processes are keeping the mount point busy.

    $ lsof /mnt/beegfs
    
  2. If that is the case, you will see a list of processes as follows:

    COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
    bash    21354  john  cwd    DIR   0,18        1    2 /mnt/beegfs
    bash    21355  mary  cwd    DIR   0,18        1    2 /mnt/beegfs
    
  3. Close the found processes. If that doesn’t work, try to use the following command to kill them:

    $ fuser -k /mnt/beegfs
    
  4. Wait 5 seconds.

  5. Retry the unmount / client stop

  6. If you still get the same error message, you might still be able to identify the processes by running lsof (without path argument) and checking the referenced paths. If a process is keeping a handler to a path like /mnt/beegfs/mydir1, then the path in the new list would appear as /mydir1 (the former mountpoint removed from the path).

What happens with BeeGFS in a split brain scenario?

In a split brain scenario, only the nodes of one system partition remain online, which is the partition where the management service is running. The nodes of the separated partition will deny access to files until they were reconnected and could resume their communication with the management service. Actually, their services would stall and only start producing IO errors if the system remained split for too long.

This behavior is intentional, because in typical BeeGFS use-cases, users can not be allowed to access data that might be out of date. Especially if write access is allowed in both partitions. In this case, the same file could end up being modified at both partitions, during the time when they were split, making it impossible for the synchronization to occur later when the split was over.

Why do BeeGFS clients using RDMA log “beegfs: enabling unsafe global rkey” in the kernel?

Should I be concerned about this message?

No, this message is expected behavior on BeeGFS clients utilizing Remote Direct Memory Access (RDMA) when the kernel does not support “DMA keys” and only supports global rkeys.

What does this message mean?

Remote Direct Memory Access (RDMA) allows data to be transferred directly between the memory of two computers on a network without involving the processor, cache, or operating system of either computer. In RDMA, a ‘remote key’ or ‘rkey’ is used to permit peer RDMA READ/WRITE access to a specific region of memory that is registered with a protection domain using the shared rkey. This message indicates that the Linux kernel has logged the creation of a protection domain with the global rkey option enabled, which is a common practice in Linux RDMA software modules to optimize performance for small I/O requests. The ‘unsafe’ in the message merely highlights that the usage is not the default secure option but doesn’t necessarily imply any vulnerability or risk, especially since BeeGFS employs stringent security practices to mitigate any potential risks associated with using global rkey.

In BeeGFS, each connection has a separate protection domain. This means when BeeGFS’ uses a global rkey it is still only providing access to memory registered for a particular connection. This greatly secures the use of a global rkey while optimizing RDMA performance. It is always strongly recommended to enable Connection Based Authentication. to prevent unauthorized clients or servers from interacting with the file system.

How can I get rid of this message?

Because this warning is logged by the kernel, it is not possible for BeeGFS to suppress this message when global rkeys are used. One alternative to using global rkeys is to use a “DMA key”, though this is not necessarily considered to be safer because they both provide access to all RDMA registered memory for a particular protection domain. However opting to use a “DMA key” will prevent the logging of “enabling unsafe global rkey”.

Historically BeeGFS would automatically use the DMA key option if it was supported by the kernel, but support for DMA keys was removed in kernel 4.9 and later. DMA keys are now are only supported when using the NVIDIA (Mellanox) OFED drivers. To allow users to continue using the “DMA key” option, starting in BeeGFS 7.4.2 and 7.2.12 a new client side connRDMAKeyType option was introduced that can be set to “dma” to enable the use of DMA keys when the OFED is installed.

If you wish to use the inbox RDMA drivers with kernel 4.9 or later then the only way to suppress the message is to adjust the kernel log level to only log kernel errors and above. Please note adjusting the kernel log level may result in other useful log messages being omitted, which may make future system troubleshooting difficult.

Loading the BeeGFS client module fails for “could not insert ‘beegfs’: Invalid argument”?

When attempting to load the BeeGFS client module, you may encounter an error that reads: “could not insert ‘beegfs’: Invalid argument”. This FAQ provides troubleshooting steps to resolve this issue. Begin by examining the output of dmesg leading up to the error and match it to one of the sections below.

beegfs: disagrees about version of symbol

The BeeGFS client module must be compiled to work with the exact combination of kernel and RDMA driver versions currently in use by each Linux installation it is used with. While various symbols might be flagged by the error, in nearly all cases this class of errors indicates the client module was built for a different combination of versions than is actively loaded on the system. A frequent scenario is when the NVIDIA (formerly Mellanox) OFED drivers are loaded, but the module was built for the inbox RDMA drivers provided by the kernel.

To correct:

  • Run ofed_info to check for the installation of the Mellanox OFED drivers.

  • If you are using the “beegfs-client” package modify /etc/beegfs/beegfs-client-autobuild.conf to append OFED_INCLUDE_PATH=<PATH> to the buildArgs=-j8 line, modifying <PATH> to point to the appropriate headers (for detailed information, refer to the help comments in the file).

  • If you are using the “beegfs-dkms-client” package refer to the documentation for the BeeGFS DKMS Client and Handling Third-party OFED Installations.