BeeGFS combines multiple storage servers to provide a highly scalable shared network file system
with striped file contents. This way, it allows users to overcome the tight performance limitations
of single servers, single network interconnects, a limited number of hard drives, etc. In such a
system, high throughput demands of large numbers of clients can easily be satisfied, but even a
single client can benefit from the aggregated performance of all the storage servers in the system.
This is made possible by a separation of metadata and file contents. While storage servers are
responsible for storing stripes of the actual contents of user files, metadata servers do the
coordination of file placement and striping among the storage servers and inform the clients about
certain file details when necessary. When accessing file contents, BeeGFS clients directly contact
the storage servers to perform file I/O and communicate with multiple servers simultaneously, giving
your applications truly parallel access to the file data. To keep the metadata access latency
(e.g., directory lookups) at a minimum, BeeGFS also allows you to distribute the metadata across
multiple servers so that each of the metadata servers stores a part of the global file system
namespace.
On the server side, BeeGFS runs as normal user-space daemons without any special requirements on the
operating system. The BeeGFS client is implemented as a Linux kernel module which provides a normal
mount-point so that your applications can directly access the BeeGFS storage system and do not need
to be modified to take advantage of BeeGFS. The module can be installed on all supported Linux
kernels without the need for any patches.
The following picture shows the system architecture and roles within a BeeGFS instance.
System Architecture Overview: Parallelism and Scale-Out¶
In the picture above, all services are running on different hosts to show which services generally
exist in a BeeGFS storage cluster. However, it is also possible to run any combination of BeeGFS
services (client and server components) together on the same machines. This also applies to multiple
instances of the same service by using Multi Mode. When BeeGFS is used
completely without separate dedicated storage servers, we call this a “converged setup”, as shown in
the picture below.
Besides the three basic roles in BeeGFS (clients, metadata service, storage service) there are two
additional system services that are part of BeeGFS: The management service, which serves as registry
and watchdog for clients and servers. It is essential for the operation of the system, but is not
directly involved in file operations and thus not critical for performance.
The optional monitoring service collects performance data from the servers and feeds it to a time
series database (for example InfluxDB). From this, real time monitoring and statistical analysis
of the system is possible with tools like Grafana.
BeeGFS supports user and group quota tracking and enforcement of used disk space and inode
count on the storage targets. The quota facilities of the underlying filesystems is employed to
gather quota information, which are reported to the management node at regular intervals. If a user
or group exceeds the configured limit, write operations are blocked until space is freed.
The metadata nodes can be configured to emit a stream of messages, which allows external programs to
monitor the actions executed on the filesystem by the clients. This includes create, delete, move,
and other actions on files and directories. This information, can for example, be used to feed the
database of a policy engine.
Sketch of an example of how BeeGFS file modification events can be used.¶
Storage targets are chosen depending on the free space available when new files are created, in
order to balance the disk space usage. Targets are automatically placed in different capacity pools
depending on their free space. There are three pools: Normal, Low, and Emergency.
The target chooser algorithm prefers targets from the Normal pool, and falls back to the Low
and Emergency pools only when the requested striping cannot be achieved otherwise. The disk
space limits are the same for all targets and can be configured on the management daemon.
Storage pools allow the cluster admin to group targets and mirror buddy groups together in different
classes. For instance, there can be one pool consisting of fast, but small solid state drives, and
another pool for bulk storage, using big but slower spinning disks. Pools can have descriptive
names, making it easy to remember which pool to use without looking up the storage targets in the
pool. The SSD pool could be named “fast” and the other “bulk”.
When a file is created by an application or user, they can choose which storage pool the file
contents are stored in. Since the concept of storage pools is orthogonal to the file system’s
directory structure, even files in the same directory can be in different storage pools.
The most intuitive way to expose the storage pools to the user, however, is using a single directory
per pool, giving the user the opportunity to assign files to pools just by sorting them into
different directories. More fine-grained control can be exerted using the command line tool
beegfs-ctl, which allows data placement down to the file level. Usually, the user will want to
move the current working project to the fast pool and keep it there until it is finished.
A storage pool consists of one or multiple storage targets. If storage mirroring is enabled, both
targets of a mirror buddy group must be in the same pool, and the mirror buddy group itself must
also be in the same storage pool.
Quota settings can be configured per Storage Pool.
Mirroring is not a replacement for backups. If files are accidentally deleted or overwritten
by a user or process, mirroring won’t help you to bring the old file back. You are still
responsible for doing regular backups of your important bits.
BeeGFS provides support for metadata and file contents mirroring. Mirroring capabilities are
integrated into the normal BeeGFS services, so that no separate services or third-party tools are
needed. Both types of mirroring (metadata mirroring and file contents mirroring) can be used
independently of each other. Mirroring also provides some high availability features.
Storage and metadata mirroring with high availability is based on so-called buddy groups. In
general, a buddy group is a pair of two targets that internally manage data replication between each
other. The buddy group approach allows one half of all servers in a system to fail while all data is
still accessible. It can also be used to put buddies in different failure domains or different fire
domains, e.g., different racks or different server rooms.
Storage Buddy Mirroring: 4 Servers with 1 Target per Server¶
Storage server buddy mirroring can also be used with an odd number of storage servers. This works,
because BeeGFS buddy groups are composed of individual storage targets, independent of their
assignment to servers, as shown in the following example graphic with three servers and two storage
targets per server. (In general, a storage buddy group could even be composed of two targets that
are attached to the same server.)
Storage Buddy Mirroring: 3 Servers with 2 Targets per Server¶
Note that this is not possible with metadata servers since there are no metadata targets in BeeGFS.
An even number of metadata servers is needed so that every metadata server can belong to a buddy
group.
In normal operation, one of the storage targets (or metadata servers) in a buddy group is considered
to be the primary, whereas the other is the secondary. Modifying operations will always be sent to
the primary first, which takes care of the mirroring process. File contents and metadata are
mirrored synchronously, i.e. the client operation completes after both copies of the data were
transferred to the servers.
If the primary storage target or metadata server of a buddy group is unreachable, it will get marked
as offline, and a failover to the secondary will be issued. In this case, the former secondary will
become the new primary. Such a failover is transparent and happens without any loss of data for
running applications. The failover will happen after a short delay, to guarantee consistency of the
system while the change information is propagated to all nodes. This short delay also avoids
unnecessary resynchronizations if a service is simply restarted, e.g., in case of a rolling update.
For more information on possible target states see here: Target States
Please note that targets and servers that belong to a buddy group are also still available to store
unmirrored data, so it is easily possible to have a filesystem which only mirrors a certain subset
of the data.
A failover of a buddy group can only happen if the BeeGFS Management service is running. That means
that no failover will occur if the node with the beegfs-mgmtd service crashes. Therefore, it is
recommended to have the beegfs-mgmtd service running on a different machine. However, it is not
necessary to have a dedicated server for the beegfs-mgmtd service, as explained here.
A file is striped across multiple targets. It is also possible to stripe the file across multiple
targets and mirror the data for resiliency. It is possible to change the number of storage targets
which should be used for the striping and the chunk size. The file system administrator can set the
stripe pattern (raid0 and buddymirror, see Mirroring) using the beegfs-ctl.
Chunks of file data are distributed across multiple storage targets. If the file is
buddy-mirrored, each chunk will be duplicated onto two targets. Each target can store
chunks of unmirrored, and if part of a buddy group additionally of mirrored files.¶
Metadata is distributed across the metadata nodes on a per directory basis. The content of a
directory is always stored on one metadata node. A metadata node is chosen at random for each
directory.
Root Node: One of the metadata servers in BeeGFS will become a special root metadata node,
which serves as an initial coordinator and as a guide to the system for clients. (beegfs-ctl
can be used to find out which node has been chosen to be the root
node). The management daemon saves the root node ID when a root node was chosen, so the election
of the root node only happens during the very first startup of the system.
When a file is deleted while it is still open by a process, its inode is moved to the
directory disposal on the metadata node. Later, when the file is closed, its inode and chunks are
finally erased. So, it is normal to see files in the disposal directories, as long as processes
still hold them open.
If a disposal file cannot be deleted because it is still being used by some process, you will see a
message like the one below.
[1-573EC7CC-C] File still in use
Nothing to worry about. Just wait for the process end and that disposal file will be deleted. If you
want to identify such process, please run the command below on your client nodes:
$ lsof | grep "(deleted)"
In case the process is aborted, the Linux Kernel will automatically close all files the process was
using, and the BeeGFS client module will keep on trying to send the close operation to the metadata
server. The client would do that also if the network was disconnected at the time of the close
operation.
In case the client node crashes or is rebooted, the disposal files will be removed by the metadata
server after about 30 minutes when the client node is marked as dead.
Some network file systems have a limitation regarding the number of groups that a user can be part
of for the access permission check to work correctly. BeeGFS does not have such a limitation, so a
user can be in as many groups as the operating system supports.