Architecture¶

Overview¶

BeeGFS combines multiple storage servers to provide a highly scalable shared network file system with striped file contents. This way, it allows users to overcome the tight performance limitations of single servers, single network interconnects, a limited number of hard drives, etc. In such a system, high throughput demands of large numbers of clients can easily be satisfied, but even a single client can benefit from the aggregated performance of all the storage servers in the system.

This is made possible by a separation of metadata and file contents. While storage servers are responsible for storing stripes of the actual contents of user files, metadata servers do the coordination of file placement and striping among the storage servers and inform the clients about certain file details when necessary. When accessing file contents, BeeGFS clients directly contact the storage servers to perform file I/O and communicate with multiple servers simultaneously, giving your applications truly parallel access to the file data. To keep the metadata access latency (e.g., directory lookups) at a minimum, BeeGFS also allows you to distribute the metadata across multiple servers so that each of the metadata servers stores a part of the global file system namespace.

On the server side, BeeGFS runs as normal user-space daemons without any special requirements on the operating system. The BeeGFS client is implemented as a Linux kernel module which provides a normal mount-point so that your applications can directly access the BeeGFS storage system and do not need to be modified to take advantage of BeeGFS. The module can be installed on all supported Linux kernels without the need for any patches.

The following picture shows the system architecture and roles within a BeeGFS instance.

BeeGFS System Architecture — System Architecture Overview: Parallelism and Scale-Out¶

In the picture above, all services are running on different hosts to show which services generally exist in a BeeGFS storage cluster. However, it is also possible to run any combination of BeeGFS services (client and server components) together on the same machines. This also applies to multiple instances of the same service by using Multi Mode. When BeeGFS is used completely without separate dedicated storage servers, we call this a “converged setup”, as shown in the picture below.

BeeGFS Converged System Setup¶

Besides the three basic roles in BeeGFS (clients, metadata service, storage service) there are two additional system services that are part of BeeGFS: The management service, which serves as registry and watchdog for clients and servers. It is essential for the operation of the system, but is not directly involved in file operations and thus not critical for performance. The optional monitoring service collects performance data from the servers and feeds it to a time series database (for example InfluxDB). From this, real time monitoring and statistical analysis of the system is possible with tools like Grafana.

Features¶

Quota Support¶

BeeGFS supports user and group quota tracking and enforcement of used disk space and inode count on the storage targets. The quota facilities of the underlying filesystems is employed to gather quota information, which are reported to the management node at regular intervals. If a user or group exceeds the configured limit, write operations are blocked until space is freed.

Filesystem Event Monitoring¶

The metadata nodes can be configured to emit a stream of messages, which allows external programs to monitor the actions executed on the filesystem by the clients. This includes create, delete, move, and other actions on files and directories. This information, can for example, be used to feed the database of a policy engine.

BeeGFS File Modification Events — Sketch of an example of how BeeGFS file modification events can be used.¶

Capacity Pools¶

Storage targets are chosen depending on the free space available when new files are created, in order to balance the disk space usage. Targets are automatically placed in different capacity pools depending on their free space. There are three pools: Normal, Low, and Emergency. The target chooser algorithm prefers targets from the Normal pool, and falls back to the Low and Emergency pools only when the requested striping cannot be achieved otherwise. The disk space limits are the same for all targets and can be configured on the management daemon.

Storage Pools¶

Storage pools allow the cluster admin to group targets and mirror buddy groups together in different classes. For instance, there can be one pool consisting of fast, but small solid state drives, and another pool for bulk storage, using big but slower spinning disks. Pools can have descriptive names, making it easy to remember which pool to use without looking up the storage targets in the pool. The SSD pool could be named “fast” and the other “bulk”.

When a file is created by an application or user, they can choose which storage pool the file contents are stored in. Since the concept of storage pools is orthogonal to the file system’s directory structure, even files in the same directory can be in different storage pools.

The most intuitive way to expose the storage pools to the user is using a single directory per pool, giving the user the opportunity to assign files to pools just by sorting them into different directories. More fine-grained control is possible using the command line tool beegfs, which allows data placement down to the file level. Usually though, users will just want to move their current project data to the fast pool and keep it there until they are finished.

A storage pool consists of one or more storage targets. If storage mirroring is enabled, both targets of a mirror buddy group must be in the same pool, and the mirror buddy group itself must also be in the same storage pool.

Quota settings can be configured per Storage Pool.

Mirroring¶

Warning

Mirroring is not a replacement for backups. If files are accidentally deleted or overwritten by a user or process, mirroring won’t help you to bring the old file back. You are still responsible for doing regular backups of your important bits.

BeeGFS provides support for metadata and file contents mirroring. Mirroring capabilities are integrated into the normal BeeGFS services, so that no separate services or third-party tools are needed. Both types of mirroring (metadata mirroring and file contents mirroring) can be used independently of each other. Mirroring also provides some high availability features.

Storage and metadata mirroring with high availability is based on so-called buddy groups. In general, a buddy group is a pair of two targets that internally manage data replication between each other. The buddy group approach allows one half of all servers in a system to fail while all data is still accessible. It can also be used to put buddies in different failure domains or different fire domains, e.g., different racks or different server rooms.

../_images/mirroring_1.png — Storage Buddy Mirroring: 4 Servers with 1 Target per Server¶

Storage server buddy mirroring can also be used with an odd number of storage servers. This works, because BeeGFS buddy groups are composed of individual storage targets, independent of their assignment to servers, as shown in the following example graphic with three servers and two storage targets per server. (In general, a storage buddy group could even be composed of two targets that are attached to the same server.)

../_images/mirroring_2.png — Storage Buddy Mirroring: 3 Servers with 2 Targets per Server¶

Note that this is not possible with metadata servers since there are no metadata targets in BeeGFS. An even number of metadata servers is needed so that every metadata server can belong to a buddy group.

In normal operation, one of the storage targets (or metadata servers) in a buddy group is considered to be the primary, whereas the other is the secondary. Modifying operations will always be sent to the primary first, which takes care of the mirroring process. File contents and metadata are mirrored synchronously, i.e. the client operation completes after both copies of the data were transferred to the servers.

If the primary storage target or metadata server of a buddy group is unreachable, it will get marked as offline, and a failover to the secondary will be issued. In this case, the former secondary will become the new primary. Such a failover is transparent and happens without any loss of data for running applications. The failover will happen after a short delay, to guarantee consistency of the system while the change information is propagated to all nodes. This short delay also avoids unnecessary resynchronizations if a service is simply restarted, e.g., in case of a rolling update. For more information on possible target states see here: Target States

Please note that targets and servers that belong to a buddy group are also still available to store unmirrored data, so it is easily possible to have a filesystem which only mirrors a certain subset of the data.

A failover of a buddy group can only happen if the BeeGFS Management service is running. That means that no failover will occur if the node with the beegfs-mgmtd service crashes. Therefore, it is recommended to have the beegfs-mgmtd service running on a different machine. However, it is not necessary to have a dedicated server for the beegfs-mgmtd service, as explained here.

Setup of mirroring is described in Mirroring.

Architecture Details¶

File Striping¶

A file is striped across multiple targets. It is also possible to stripe the file across multiple targets and mirror the data for resiliency. It is possible to change the number of storage targets which should be used for the striping and the chunk size. The file system administrator can set the stripe pattern (raid0 and mirrored, see Mirroring) using the beegfs.

../_images/striping.png — Chunks of file data are distributed across multiple storage targets. If the file is buddy-mirrored, each chunk will be duplicated onto two targets. Each target can store chunks of unmirrored files, and chunks of mirrored files if part of a buddy group.¶

Metadata Distribution¶

Metadata is distributed across the metadata nodes on a per directory basis. The content of a directory is always stored on one metadata node. A metadata node is chosen at random for each directory.

Root Node: One of the metadata servers in BeeGFS will become a special root metadata node, which serves as an initial coordinator and as a guide to the system for clients. To determine which node has been chosen as the root node run beegfs entry info --mount=none / and look which “META NODE” owns the entry. The management daemon saves the root node ID when a root node was chosen, so the election of the root node only happens during the very first system startup.

Deletion of files in use¶

When a file is deleted while it is still open by a process, its inode is moved to the directory disposal on the metadata node. Later, when the file is closed, its inode and chunks are finally erased. So, it is normal to see files in the disposal directories, as long as processes still hold them open.

$ beegfs entry dispose-unused --print-files

If a disposal file cannot be deleted because it is still being used by some process, you will see a message like the one below.

[1-573EC7CC-C] File still in use

Nothing to worry about. Just wait for the process end and that disposal file will be deleted. If you want to identify such process, please run the command below on your client nodes:

$ lsof | grep "(deleted)"

In case the process is aborted, the Linux Kernel will automatically close all files the process was using, and the BeeGFS client module will keep on trying to send the close operation to the metadata server. The client would do that also if the network was disconnected at the time of the close operation.

In case the client node crashes or is rebooted, the disposal files will be removed by the metadata server after about 30 minutes when the client node is marked as dead.

Limitations¶

Maximum number of groups per user¶

Some network file systems have a limitation regarding the number of groups that a user can be part of for the access permission check to work correctly. BeeGFS does not have such a limitation, so a user can be in as many groups as the operating system supports.