Remote Storage Targets¶
Introduction¶
By default all files in BeeGFS are striped across one or more storage targets using a proprietary chunk file format optimized for high performance parallel access. BeeGFS also supports synchronizing files with one or more Remote Storage Targets. Any S3-compatible external storage provider can be used as a remote target including on-premises solutions or cloud providers. In addition to core file system services (Management, Metadata, and Storage) using Remote Storage Targets requires deploying an additional Remote service that coordinates synchronizing files using one or more Sync services.
The Remote and Sync services can be deployed on the same physical server as other core file system services, or on dedicated servers, depending on the available hardware and requirements of a particular environment. This allows remote targets to be added to existing BeeGFS deployments where the servers running core services may not have been sized with the remote targets feature in mind. This also makes it possible to avoid exposing the servers running core BeeGFS services directly to the internet, and treat the Remote/Sync servers as “gateway” nodes.
┌────────────────┐ │ Remote Service ◄───┐ ┌──────────────┐ └────────▲───────┘ │ │ │ │ │ ┌─────────────┐ │ BeeGFS ◄────────────┤ ├──►│ S3 Provider │ │ │ │ │ └─────────────┘ └──────────────┘ ┌────────▼───────┐ │ │ Sync Service ◄───┘ └────────────────┘
Files and directories in BeeGFS can be associated with one or more Remote Storage Targets. The
remote target configuration for each entry is stored by the BeeGFS metadata service alongside all
the other configuration for the entry. The beegfs
command-line tool is used to synchronize
entries with a remote target and can be used to push (upload) entries or pull (download) entries.
Limitations and Known Issues¶
Because BeeGFS Remote uses the file size to determine how to split files into multiple work requests, when the underlying file system for storage targets is ZFS the
beegfs entry refresh
command must be run against new/modified files before usingbeegfs remote push
. This is because ZFS does not immediately update file sizes when files are closed which can lead to discrepancies between the size recorded in the BeeGFS Metadata server and the actual file size.
Requirements¶
Capacity¶
When a path in BeeGFS needs to be synchronized, a job is created in the Remote service. Depending on the size of the file one or more work requests will generated and assigned to the Sync service(s) that handle actually moving around data. The system is designed so both the Remote and Sync services can be restarted and resume synchronizing data from where they left off while minimizing the amount of data that must be retransferred.
The Remote service needs to store information about pending, active, and historical jobs on a per-path basis, but until a path is synchronized it will not have an entry in the Remote database. While capacity requirements vary depending on the job, generally jobs only require a few kiB each, and the number of historical jobs retained for each path is capped by default at four. This means the capacity requirements for a file system containing 1 billion files retaining at most four jobs for each file and jobs averaging 3KB would require ~11TiB. Note this is a rough worst case estimate and the actual requirements would likely be less due to compression.
The Sync service only needs to store information about work requests it is currently assigned thus requires significantly less space. However this also makes the capacity requirements harder to estimate since they depend on the number of active jobs that may be present at any one time. For example a system synchronizing tens of thousands of files concurrently generally only require 200-300MiB, but to allow for bursts of activity where potentially millions of files are being synchronized it would be better to allocate 100-200GiB per Sync service. Note that again this is a rough worst case estimate and the actual requirements would likely be less.
Getting Started¶
Prerequisites¶
On all servers that will run Remote or Sync services:
Add the BeeGFS package repositories to your package manager.
Configure TLS. The exact steps will depend how you choose to configure TLS for the BeeGFS management service.
Unless you opted to disable TLS or want to use different TLS certificates for Remote/Sync services, you could just copy the same certificate and key files used by the management service and install them under
/etc/beegfs
on all Remote/Sync servers. This is what the default configuration files expect and is the easiest way to get TLS configured uniformly.
Install the BeeGFS client and mount BeeGFS at
/mnt/beegfs
. For more details refer to the quick start or manual installation guides.
Install/Configure the Sync Service(s)¶
On all servers that will run Sync service:
Install the
beegfs-sync
package using your package manager.For example:
dnf install beegfs-sync
.
Most of the service’s configuration is centrally managed by the Remote service and automatically pushed to all Sync servers. This means the default
/etc/beegfs/beegfs-sync.toml
file may not need to be modified, however there are some settings to be aware of (see the default file for more details):The
[server]
section controls how the Remote service connects to the Sync service. By default the service listens on all available addresses using port 9011, but you can optionally limit whataddress
the service listens on, and the TLS configuration.The
[manager]
section is used to configure where the Sync service keeps track of active and pending work requests. It uses thejournal-db
to track and resume ongoing work requests after a restart, and thejob-db
to optimize looking up jobs and work requests.The
[remote]
section controls how the Sync service connects to the Remote service. Because the address is synchronized only the TLS configuration may need customization.
Start and enable the service to ensure it automatically restarts after a reboot:
systemctl enable --now beegfs-sync
.Verify the service finished startup and is serving gRPC requests:
journalctl -u beegfs-sync
.
Install/Configure the Remote Service¶
On a single server that will run the Remote service:
Install the
beegfs-remote
package using your package manager.For example:
dnf install beegfs-remote
.
Using your preferred text editor, edit the
/etc/beegfs/beegfs-remote.toml
file. Available settings are documented in the file, here is an overview of the minimum configuration required to run the Remote service:The
[management]
section controls how Remote connects to the management service. Update theaddress
to the IP address/hostname and port where the management service is listening for gRPC traffic. If you are using connection based authentication download the same shared secret as used for your BeeGFS management service to/etc/beegfs/conn.auth
(otherwise setauth-disable = true
). If needed adjust the TLS configuration used to connect to the management service.The
[server]
section controls how thebeegfs
tool and Sync services connect to the Remote service. By default the service listens on all available addresses using port 9010, but you can optionally limit whataddress
the service listens on, and the TLS configuration.The
[job]
section controls where the Remote service keeps track of historical, active, and pending data synchronization jobs for each path. This information is stored using BadgerDB, and the path where the database is stored can be customized withpath-db
.Add
[[worker]]
sections for each Sync node. Only the configuration for a single Sync node should be specified in each[[worker]]
section.Add
[[remote-storage-target]]
sections for each remote target. Example configuration for common S3 providers is documented in the default file. Note theid
is how entries in BeeGFS are associated with one or more remote targets and must be unique for each remote target.
Start and enable the service to ensure it automatically restarts after a reboot:
systemctl enable --now beegfs-remote
.Verify the service finished startup and was able to connect to all Sync nodes by running:
journalctl -u beegfs-remote
.If you now check the Sync node logs you should see “successfully applied new configuration”.
Using Remote Storage Targets¶
Warning
Uploading and downloading data using remote push/pull commands may incur charges depending on the storage provider used for remote targets. These charges typically include ingress/egress fees based on the amount of data transferred and the number of API requests made to the storage provider. While the Remote Storage Target feature is designed to minimize unnecessary requests, it is highly recommended that you first perform a test run with a small dataset that accurately represents your typical data. This will help you evaluate the potential cost of syncing data with a particular storage provider before transferring large amounts of data.
To interact with Remote Storage Targets use the beegfs
tool provided with the beegfs-tools
package. The following section outlines common tasks, but help is also available as part of the
beegfs
tool by appending --help
to any command.
Commands for interacting with remote targets can be found under beegfs remote
. For example to
get the list of available remote targets run: beegfs remote list
. Similar to storage pools,
remote targets can be configured on a per file or directory basis. When configured on a directory
they will be automatically inherited by entries created under that directory. Unlike storage pools,
remote targets can be updated on both file and directories at any time. Use the standard commands
for interacting with entries to also check and apply remote target configuration:
Inspect remote target configuration:
beegfs entry info <path>
Set remote target configuration:
beegfs entry set --remote-targets=<id>[,<id>]... <path>
Once entries have remote target(s) set they can be pushed (uploaded) to those target(s) by running:
beegfs remote push <path>
. If the specified <path>
is a directory then all entries under
that directory will be recursively pushed to whichever target(s) they are associated with. It is
also possible to perform a one-time push to a particular remote target by specifying the
--remote-target
flag when pushing entries.
To pull (download) entries from a remote target run: beegfs remote pull --remote-target=<id>
--remote-path=<string> <path>
. For a pull the remote target ID and the remote path (typically the
object name) must always be specified along with the path inside BeeGFS where the file will be
pulled. Note when files are pulled into BeeGFS they are not automatically associated with the
original remote target ID, but rather inherit their remote target configuration from the parent
directory where they are downloaded.
When entries are pushed or pulled the command will return as soon as job(s) are scheduled for those
path(s). Jobs will be queued behind other outstanding jobs and run asynchronously in the background.
To check the synchronization status use beegfs remote status <path>
where path could be a single
file, or directory to recursively return the status of all files under that directory.
If a job encounters an unrecoverable error and retries are exhausted, for example if the remote
target is not serving requests, then jobs will move to a failed status and require user intervention
to proceed. If a job fails use the --debug
flag on the status command to investigate any issues
with the remote target and take corrective measures before canceling the failed job with remote
job cancel
and retrying the original push/pull operation.
In some cases corrective measures may require updating the remote target configuration on the Remote service, for example if the provided credentials have expired. To update remote target configuration it is currently required to first reconfigure/restart the Remote service then restart each of the Sync services.
FAQs¶
Does enabling Remote Storage Targets have any impact on file system performance?¶
When Remote Storage Targets are assigned to entries in BeeGFS an additional hidden extended attribute is created on the associated inode (or dentry for inlined inodes) that is used to store the list of targets and other RST specific configuration for a particular entry. This minimizes the impact enabling Remote functionality has on the core file system when remote targets are not in use, at the cost of incurring a slight performance penalty when performing certain metadata operations on entries with remote targets configured. For example you may observe slightly lower performance creating entries in a directory assigned to remote target(s) as there is an additional extended attribute that must be read from the parent directory and created for each child entry.
Tip
If you are concerned assigning remote targets to entries is impacting performance you can entirely
remove remote configuration from entries with beegfs entry set --remote-targets=none
.
While the performance impact is minimal, it is also possible to use all Remote functionality without
assigning remote targets to entries by simply specifying the target you want to interact with using
the --remote-target
flag when running remote
commands such as push, pull, and status. Even
if you have remote targets assigned to entries, if you know the remote target you want to interact
with, specifying the target will optimize command execution as it allows the beegfs
tool to skip
reading remote targets from each entries’ metadata.
Assigning remote targets to entries is helpful if you want to associate certain directory trees or
files with specific remote targets so you can simply use beegfs remote push
to ensure everything
is synced with their correct targets. If you are mostly syncing data to a single target or have a
few directories synced with different targets, you may find configuring remote targets on individual
entries is not needed. The feature is designed and intended to be used with or without configuring
remote targets on individual entries depending on the requirements of a particular environment.