Remote Storage Targets

Introduction

By default all files in BeeGFS are striped across one or more storage targets using a proprietary chunk file format optimized for high performance parallel access. BeeGFS also supports synchronizing files with one or more Remote Storage Targets. Any S3-compatible external storage provider can be used as a remote target including on-premises solutions or cloud providers. In addition to core file system services (Management, Metadata, and Storage) using Remote Storage Targets requires deploying an additional Remote service that coordinates synchronizing files using one or more Sync services.

The Remote and Sync services can be deployed on the same physical server as other core file system services, or on dedicated servers, depending on the available hardware and requirements of a particular environment. This allows remote targets to be added to existing BeeGFS deployments where the servers running core services may not have been sized with the remote targets feature in mind. This also makes it possible to avoid exposing the servers running core BeeGFS services directly to the internet, and treat the Remote/Sync servers as “gateway” nodes.

┌────────────────┐                                              │ Remote Service ◄───┐                           ┌──────────────┐   └────────▲───────┘   │                           │              │            │           │   ┌─────────────┐                           │    BeeGFS    ◄────────────┤           ├──►│ S3 Provider │                           │              │            │           │   └─────────────┘                           └──────────────┘   ┌────────▼───────┐   │                                              │  Sync Service  ◄───┘                                              └────────────────┘

Files and directories in BeeGFS can be associated with one or more Remote Storage Targets. The remote target configuration for each entry is stored by the BeeGFS metadata service alongside all the other configuration for the entry. The beegfs command-line tool is used to synchronize entries with a remote target and can be used to push (upload) entries or pull (download) entries.

Limitations and Known Issues

  • Because BeeGFS Remote uses the file size to determine how to split files into multiple work requests, when the underlying file system for storage targets is ZFS the beegfs entry refresh command must be run against new/modified files before using beegfs remote push. This is because ZFS does not immediately update file sizes when files are closed which can lead to discrepancies between the size recorded in the BeeGFS Metadata server and the actual file size.

Requirements

Capacity

When a path in BeeGFS needs to be synchronized, a job is created in the Remote service. Depending on the size of the file one or more work requests will generated and assigned to the Sync service(s) that handle actually moving around data. The system is designed so both the Remote and Sync services can be restarted and resume synchronizing data from where they left off while minimizing the amount of data that must be retransferred.

The Remote service needs to store information about pending, active, and historical jobs on a per-path basis, but until a path is synchronized it will not have an entry in the Remote database. While capacity requirements vary depending on the job, generally jobs only require a few kiB each, and the number of historical jobs retained for each path is capped by default at four. This means the capacity requirements for a file system containing 1 billion files retaining at most four jobs for each file and jobs averaging 3KB would require ~11TiB. Note this is a rough worst case estimate and the actual requirements would likely be less due to compression.

The Sync service only needs to store information about work requests it is currently assigned thus requires significantly less space. However this also makes the capacity requirements harder to estimate since they depend on the number of active jobs that may be present at any one time. For example a system synchronizing tens of thousands of files concurrently generally only require 200-300MiB, but to allow for bursts of activity where potentially millions of files are being synchronized it would be better to allocate 100-200GiB per Sync service. Note that again this is a rough worst case estimate and the actual requirements would likely be less.

Getting Started

Prerequisites

On all servers that will run Remote or Sync services:

  1. Add the BeeGFS package repositories to your package manager.

  2. Configure TLS. The exact steps will depend how you choose to configure TLS for the BeeGFS management service.

    • Unless you opted to disable TLS or want to use different TLS certificates for Remote/Sync services, you could just copy the same certificate and key files used by the management service and install them under /etc/beegfs on all Remote/Sync servers. This is what the default configuration files expect and is the easiest way to get TLS configured uniformly.

  3. Install the BeeGFS client and mount BeeGFS at /mnt/beegfs. For more details refer to the quick start or manual installation guides.

Install/Configure the Sync Service(s)

On all servers that will run Sync service:

  1. Install the beegfs-sync package using your package manager.

    • For example: dnf install beegfs-sync.

  2. Most of the service’s configuration is centrally managed by the Remote service and automatically pushed to all Sync servers. This means the default /etc/beegfs/beegfs-sync.toml file may not need to be modified, however there are some settings to be aware of (see the default file for more details):

    • The [server] section controls how the Remote service connects to the Sync service. By default the service listens on all available addresses using port 9011, but you can optionally limit what address the service listens on, and the TLS configuration.

    • The [manager] section is used to configure where the Sync service keeps track of active and pending work requests. It uses the journal-db to track and resume ongoing work requests after a restart, and the job-db to optimize looking up jobs and work requests.

    • The [remote] section controls how the Sync service connects to the Remote service. Because the address is synchronized only the TLS configuration may need customization.

  3. Start and enable the service to ensure it automatically restarts after a reboot: systemctl enable --now beegfs-sync.

  4. Verify the service finished startup and is serving gRPC requests: journalctl -u beegfs-sync.

Install/Configure the Remote Service

On a single server that will run the Remote service:

  1. Install the beegfs-remote package using your package manager.

    • For example: dnf install beegfs-remote.

  2. Using your preferred text editor, edit the /etc/beegfs/beegfs-remote.toml file. Available settings are documented in the file, here is an overview of the minimum configuration required to run the Remote service:

    • The [management] section controls how Remote connects to the management service. Update the address to the IP address/hostname and port where the management service is listening for gRPC traffic. If you are using connection based authentication download the same shared secret as used for your BeeGFS management service to /etc/beegfs/conn.auth (otherwise set auth-disable = true). If needed adjust the TLS configuration used to connect to the management service.

    • The [server] section controls how the beegfs tool and Sync services connect to the Remote service. By default the service listens on all available addresses using port 9010, but you can optionally limit what address the service listens on, and the TLS configuration.

    • The [job] section controls where the Remote service keeps track of historical, active, and pending data synchronization jobs for each path. This information is stored using BadgerDB, and the path where the database is stored can be customized with path-db.

    • Add [[worker]] sections for each Sync node. Only the configuration for a single Sync node should be specified in each [[worker]] section.

    • Add [[remote-storage-target]] sections for each remote target. Example configuration for common S3 providers is documented in the default file. Note the id is how entries in BeeGFS are associated with one or more remote targets and must be unique for each remote target.

  3. Start and enable the service to ensure it automatically restarts after a reboot: systemctl enable --now beegfs-remote.

  4. Verify the service finished startup and was able to connect to all Sync nodes by running: journalctl -u beegfs-remote.

    • If you now check the Sync node logs you should see “successfully applied new configuration”.

Using Remote Storage Targets

Warning

Uploading and downloading data using remote push/pull commands may incur charges depending on the storage provider used for remote targets. These charges typically include ingress/egress fees based on the amount of data transferred and the number of API requests made to the storage provider. While the Remote Storage Target feature is designed to minimize unnecessary requests, it is highly recommended that you first perform a test run with a small dataset that accurately represents your typical data. This will help you evaluate the potential cost of syncing data with a particular storage provider before transferring large amounts of data.

To interact with Remote Storage Targets use the beegfs tool provided with the beegfs-tools package. The following section outlines common tasks, but help is also available as part of the beegfs tool by appending --help to any command.

Commands for interacting with remote targets can be found under beegfs remote. For example to get the list of available remote targets run: beegfs remote list. Similar to storage pools, remote targets can be configured on a per file or directory basis. When configured on a directory they will be automatically inherited by entries created under that directory. Unlike storage pools, remote targets can be updated on both file and directories at any time. Use the standard commands for interacting with entries to also check and apply remote target configuration:

  • Inspect remote target configuration: beegfs entry info <path>

  • Set remote target configuration: beegfs entry set --remote-targets=<id>[,<id>]... <path>

Once entries have remote target(s) set they can be pushed (uploaded) to those target(s) by running: beegfs remote push <path>. If the specified <path> is a directory then all entries under that directory will be recursively pushed to whichever target(s) they are associated with. It is also possible to perform a one-time push to a particular remote target by specifying the --remote-target flag when pushing entries.

To pull (download) entries from a remote target run: beegfs remote pull --remote-target=<id> --remote-path=<string> <path>. For a pull the remote target ID and the remote path (typically the object name) must always be specified along with the path inside BeeGFS where the file will be pulled. Note when files are pulled into BeeGFS they are not automatically associated with the original remote target ID, but rather inherit their remote target configuration from the parent directory where they are downloaded.

When entries are pushed or pulled the command will return as soon as job(s) are scheduled for those path(s). Jobs will be queued behind other outstanding jobs and run asynchronously in the background. To check the synchronization status use beegfs remote status <path> where path could be a single file, or directory to recursively return the status of all files under that directory.

If a job encounters an unrecoverable error and retries are exhausted, for example if the remote target is not serving requests, then jobs will move to a failed status and require user intervention to proceed. If a job fails use the --debug flag on the status command to investigate any issues with the remote target and take corrective measures before canceling the failed job with remote job cancel and retrying the original push/pull operation.

In some cases corrective measures may require updating the remote target configuration on the Remote service, for example if the provided credentials have expired. To update remote target configuration it is currently required to first reconfigure/restart the Remote service then restart each of the Sync services.

FAQs

Does enabling Remote Storage Targets have any impact on file system performance?

When Remote Storage Targets are assigned to entries in BeeGFS an additional hidden extended attribute is created on the associated inode (or dentry for inlined inodes) that is used to store the list of targets and other RST specific configuration for a particular entry. This minimizes the impact enabling Remote functionality has on the core file system when remote targets are not in use, at the cost of incurring a slight performance penalty when performing certain metadata operations on entries with remote targets configured. For example you may observe slightly lower performance creating entries in a directory assigned to remote target(s) as there is an additional extended attribute that must be read from the parent directory and created for each child entry.

Tip

If you are concerned assigning remote targets to entries is impacting performance you can entirely remove remote configuration from entries with beegfs entry set --remote-targets=none.

While the performance impact is minimal, it is also possible to use all Remote functionality without assigning remote targets to entries by simply specifying the target you want to interact with using the --remote-target flag when running remote commands such as push, pull, and status. Even if you have remote targets assigned to entries, if you know the remote target you want to interact with, specifying the target will optimize command execution as it allows the beegfs tool to skip reading remote targets from each entries’ metadata.

Assigning remote targets to entries is helpful if you want to associate certain directory trees or files with specific remote targets so you can simply use beegfs remote push to ensure everything is synced with their correct targets. If you are mostly syncing data to a single target or have a few directories synced with different targets, you may find configuring remote targets on individual entries is not needed. The feature is designed and intended to be used with or without configuring remote targets on individual entries depending on the requirements of a particular environment.