Metadata Node Tuning

This page presents some tips and recommendations on how to improve the performance of BeeGFS metadata servers. As usual, the optimal settings depend on your particular hardware and usage scenarios, so you should use these settings only as a starting point for your tuning efforts. Benchmarking tools such as mdtest would help you identify the best settings for your BeeGFS metadata servers.

Some settings suggested here are non-persistent and will be reverted after the next reboot. In order to keep them permanently, you could add the corresponding commands to /etc/rc.local, as seen in the example below, use /etc/sysctl.conf or create udev rules to reapply them automatically when the machine boots.

 1#!/bin/bash
 2echo 5 > /proc/sys/vm/dirty_background_ratio
 3echo 20 > /proc/sys/vm/dirty_ratio
 4echo 50 > /proc/sys/vm/vfs_cache_pressure
 5echo 262144 > /proc/sys/vm/min_free_kbytes
 6echo 1 > /proc/sys/vm/zone_reclaim_mode
 7
 8echo always > /sys/kernel/mm/transparent_hugepage/enabled
 9echo always > /sys/kernel/mm/transparent_hugepage/defrag
10
11devices=(sda sdb)
12for dev in "${devices[@]}"
13do
14  echo deadline > /sys/block/${dev}/queue/scheduler
15  echo 128 > /sys/block/${dev}/queue/nr_requests
16  echo 128 > /sys/block/${dev}/queue/read_ahead_kb
17  echo 256 > /sys/block/${dev}/queue/max_sectors_kb
18done

Metadata access typically consists of many small random reads and writes. Since small random writes are inefficient on RAID-5 and RAID-6 (due to the involved read-modify-write overhead), it is generally recommended to store metadata on a RAID-1 or RAID-10 volume.

Similar as for storage targets, partitions should be stripe aligned as described in the Partition Alignment Guide.

Low-latency devices like SSDs or NVMes are recommended as the storage devices of metadata targets. See also: Disk Space Requirements.

Extended Attributes

BeeGFS metadata is stored as extended file attributes (xattr) on the underlying file system. One metadata file will be created for each file that a user creates. BeeGFS Metadata files have a size of 0 bytes (i.e., no normal file contents). Access to extended attributes is possible with the getfattr tool.

If the inodes of the underlying file system are sufficiently large, xattrs can be inlined into the inode of the underlying file system, and additional data blocks are not required, reducing the disk usage. With xattrs inlined, access latencies are reduced as seeking to an extra data block is not required.

For backups of metadata, make sure the backup tool supports extended file attributes and the corresponding options are set. See Metadata Daemon Backup for details.

Hardware RAID

Partition Alignment & RAID Settings of Local File System

To get the maximum performance out of your storage devices, it is important to set each partition offset according to their respective native alignment. Check the Partition Alignment Guide for a walk-through about partition alignment and creation of a RAID-optimized local file system.

A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead, create the file system directly on the device, as shown in the following sections.

Metadata Server Throughput Tuning

In general, the BeeGFS metadata service can use any standard Linux file systems. However, ext4 is recommended, because it handles small file workloads (common on a BeeGFS metadata server) significantly faster than other local Linux file systems.

The default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your metadata servers.

Formatting Options

When formatting the ext4 partition, it is important to include options that minimize access times for large directories (-Odir_index), to create large inodes that allow storing BeeGFS metadata as extended attributes directly inside the inodes for maximum performance (-I 512), to reserve a sufficient number of inodes (-i 2048), and to use a large journal (-J size=400):

# mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/sdX

If you also use ext4 for your storage server targets, you may want to reserve fewer space for inodes and keep more space free for file contents by using -i 16384 or higher for those storage targets.

As metadata size increases with the number of targets per file, you should use -I 1024 if you are planning to stripe across more than 4 targets per file by default or if you are planning to use ACLs or store client-side extended attributes.

Due to the fact that ext4 has a fixed number of available inodes, it is possible to run out of available inodes even if free disk space is available. Thus, it is important to carefully select the number of inodes with respect to your needs if your metadata disk is small. You can check the number of available inodes by using df -ih after formatting. If you need to avoid such a limit, use a different file system (e.g. xfs) instead of ext4.

By default, ext4 does not allow user space processes to store extended attributes. If the beegfs-meta daemon is set to use extended attributes, the underlying file system has to be mounted using the option user_xattr. This option also may be stored permanently in the super-block:

# tune2fs -o user_xattr /dev/sdX

In systems expected to have directories with a large number of entries (over 10 million), the option large_dir must be set along with dir_index. This option increases the capacity of the directory index and is available in Linux kernel 4.13 or newer. Nevertheless, having a large number of entries in a single directory is not a good practice performance-wise. Therefore, end users should be encouraged to distribute their files across multiple subdirectories, even if the option large_dir is being used.

Mount Options

To avoid the overhead of updating the last access file timestamps, the metadata partition can be mounted with the noatime option without any influence on the last access timestamps that users see in an BeeGFS mount. Disable last access timestamps by adding the noatime argument to your mount options.

The command below shows typical mount options for BeeGFS metadata servers with a RAID controller.

# mount -onoatime,nodiratime /dev/sdX <mountpoint>

IO Scheduler

The deadline scheduler typically yields the best results for metadata access.

# echo deadline > /sys/block/sdX/queue/scheduler

In order to avoid latencies, the size of the requests queue should not be too high, the default value of 128 is good.

# echo 128 > /sys/block/sdX/queue/nr_requests

When tuning your metadata servers, keep in mind that this is often not so much about throughput, but rather about latency and also some amount of fairness: There are probably also some interactive users on your cluster, who want to see the results of their ls and other commands in an acceptable time, so you should try to work on that. This means, for instance, that you probably don’t want to set a high value for /sys/block/sdX/iosched/read_expire on the metadata servers to make sure that users won’t be waiting too long for their operations to complete.

Virtual Memory Settings

Transparent huge pages can cause performance degradation under high load and even stability problems on various kinds of systems. For RHEL 6.x and derivatives, it is highly recommended disabling transparent huge pages, unless huge pages are explicitly requested by an application through madvise:

# echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

For RHEL 7.x and other distributions, it is recommended to have transparent huge pages enabled:

# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# echo always > /sys/kernel/mm/transparent_hugepage/defrag

ZFS

Software RAID implementations demand more powerful machines than traditional systems with RAID controllers, especially if features like data compression and checksums are enabled. Therefore, using ZFS as the underlying file system of metadata targets will require more CPU power and RAM than a more traditional a BeeGFS installation. It will also increase the importance of disabling features like CPU frequency scaling.

It is also recommended being economical with the options enabled in ZFS, e.g., a feature like de-duplication uses a lot of resources and can have a significant impact on performance, while not providing any benefit on a metadata target.

Another important factor that impacts performance in such systems is the version of ZFS packages used. For example ZFS version 0.7.1 had some performance issues, while higher throughput was observed with version ZFS version 0.6.5.11.

Module Parameters

After loading the ZFS module, please set the module parameters below, before creating any ZFS storage pool.

IO Scheduler

Set the IO scheduler used by ZFS. Both noop and deadline, which implement simple scheduling algorithms, are good options as the storage daemon is run by a single Linux user.

# echo deadline > /sys/module/zfs/parameters/zfs_vdev_scheduler

Data Aggregation Limit

ZFS is able to aggregate small IO operations that handle neighboring or overlapping data into larger operations in order to reduce the number of IOPs. The option zfs_vdev_aggregation_limit sets the maximum amount of data that can be aggregated before the IO operation is finally performed on the disk.

# echo 262144 > /sys/module/zfs/parameters/zfs_vdev_aggregation_limit

Creating ZFS Pools for Storage Targets

Basic options like the pool type, cache, and log devices must be defined at the creation time of the pool, as seen in the example below. The mount point is optional, but it is a good practice to define it with option -m, in order to control where the storage target directory will be located.

# zpool create -m /data/meta001 meta001 mirror sda sdb

Data Protection

RAID-Z pool types are not recommended for metadata targets due to the performance impact of parity updates. The recommended type is mirror, unless data safety is considered more important than metadata performance. In this case, RAID-Z1 should be used.

Data Compression

Data compression is a feature that should be enabled because it reduces the amount of space used by BeeGFS metadata. The CPU overhead caused by the compression functions is compensated by the decrease of the amount of data involved in the IO operations. Please use the data compression algorithm lz4, which is known to have a good balance between compression ratio and performance.

# zfs set compression=lz4 meta001

Extended Attributes

As explained earlier, BeeGFS metadata is stored as extended attributes of files, and therefore, this feature must be enabled, as follows. Please, note that option xattr should be set to sa and not the default on. The default mechanism stores extended attributes as separate hidden files. The sa mechanism inlines them with the actual files they belong to, making the management of extended attributes much more efficient.

# zfs set xattr=sa meta001

Deduplication

Deduplication is a space-saving feature that works by keeping a single copy of multiple identical files stored in the ZFS file system. This feature has a significant performance impact and should be disabled if the system has plenty of storage space.

# zfs set dedup=off meta001

Unnecessary Properties

The BeeGFS storage service does not update access time. So, this property may be disabled in ZFS pools, as follows.

# zfs set atime=off meta001

System BIOS & Power Saving

To allow the Linux kernel to correctly detect the system properties and enable corresponding optimizations (e.g. for NUMA systems), it is very important to keep your system BIOS updated.

The dynamic CPU clock frequency scaling feature for power saving, which is typically enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS, where you will often find a general setting like “Optimize for performance”.

If frequency scaling is not disabled in the machine BIOS, recent Intel CPU generations require the parameter intel_pstate=disable to be added to the kernel boot command line, which is typically defined in the grub boot loader configuration file. After changing this setting, the machine needs to be rebooted.

If the Intel pstate driver is disabled or not applicable to a system, frequency scaling can be changed at runtime, e.g., via:

# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null

You can check if CPU frequency scaling is disabled by using the following command on an idle system. If it is disabled, you should see the full clock frequency in the “CPU MHz” line for all CPU cores and not a reduced value.

# cat /proc/cpuinfo

Concurrency Tuning

Worker Threads

Storage servers, metadata servers, and clients allow you to control the number of worker threads by setting the value of tuneNumWorkers (in /etc/beegfs/beegfs-X.conf). In general, a higher number of workers allows for more parallelism (e.g. a server will work on more client requests in parallel). For smaller clusters in the range of 100-200 compute nodes, you should set at least tuneNumWorkers=64 in beegfs-meta.conf. For larger clusters, tuneNumWorkers=128 is more appropriate.

Parallel Network Requests

Each metadata server establishes multiple connections to each of the other servers to enable more parallelism on the network level by having multiple requests in flight to the same server. The tuning option connMaxInternodeNum in /etc/beegfs/beegfs-meta.conf can be used to configure the number of simultaneous connections. The information provided in the client tuning guide also applies to metadata servers: Parallel Network Requests.

Metadata RAID tuning

Advantages and disadvantages of software RAID vs hardware RAID

  • Software RAID allows to monitor performance of member devices. So if one drive is slower than the others, one will notice higher latencies, utilization, etc. in tools such as ‘iostat’ for that drive.

  • While a feature to monitor performance of individual member drives of a RAID volume is typically not present in internal hardware RAID controllers, beegfs-ctl --storagebench can be used to compare performance of different storage targets and thus identify suspicious targets.

  • Software RAID doesn’t have a battery-backed cache, which can impact performance for small synchronous writes. However, also for streaming I/O, there are cases where a hardware RAID controller delivers higher throughput.

  • Replacing a disk with software RAID requires extra commands to add the new disk, while it can be as simple as swapping a drive in the case of hardware RAID.

  • Hardware RAID controllers do not support the TRIM/DISCARD command for SSDs. Software RAID supports this for RAID levels 0, 1, 10.

Write-intent bitmap

In order to speed up or to avoid resynchronization a write-intent bitmap should be set up. Hardware RAID vendors might call this feature a ‘write journal’. This bitmap devides the whole RAID array into bitmap-chunks and each of these chunks covered by a bit. Before writing to a bitmap-chunk a bit in the bitmap is set and once the write completes the bit is unset. If a RAID member device fails for some reasons, but it is possible to re-add without completely exchanging it, the bitmap allows to resync only these areas that have changed between failure and re-adding the device.

  • The bitmap may be added at any time, as it is located between RAID superblock and real data (sparse area, reserved for the bitmap and other meta-data).

  • Example

  • The bitmap-chunk size should not be smaller than 64MiB, as smaller sizes reduce write performance. Recent mdadm versions have the default of 64MiB, but older versions had a much smaller default.

    $ mdadm /dev/md/storage --grow --bitmap=internal --bitmap-chunk=128M
    

Stripe cache size

For RAID5/RAID6 device one should increase the stripe cache size to >=8192

  • Default is 256 and with that small size read-modify-writes usually reduce write performance

  • Depending on the CPU and memory performance values >8192 might reduce performance

  • Required memory is: 4kiB * number-raid-devices * stripe-cache-size

  • Example:

    $ echo 8192 > /sys/block/md127/md/stripe_cache_size
    

RAID metadata

  • Also see ‘man md’

  • Main RAID metadata is stored in the md super block. The current version is 1.2 and that is also the default for current mdadm versions.

  • Version 1.2 super block has an offset of 4kiB from the start of a device

  • Between super block and real data is another offset, for alignment and further meta data, e.g. the write-intent bitmap.

  • The default data offset has changed several times over the past few years, the current default is 128MiB.

  • Very recent mdadm versions allow to specify the data offset with the option --data-offset=

  • RAID metadata may be examined with the commands

    $ mdadm /dev/md/storage --detail $ mdadm --detail /dev/sdX # member
    device
    

Device name persistence

In order to keep device name persistence the md device should be accessed with an alias device name

  • Not good: /dev/md1

  • Good: /dev/md/storage

  • The device already can be created with this alias

    $ mdadm --create /dev/md/storage --level=6 --chunk=128 -n12 --assume-clean /dev/sd[b,e-o]
    
  • The device should be added with that name to mdadm.conf

    $ mdadm --examine --scan >>mdadm.conf
    

Location of mdadm.conf

  • RHEL and SLES: /etc/mdadm.conf

  • Debian: /etc/mdadm/mdadm.conf