Metadata Node Tuning¶
This page presents some tips and recommendations on how to improve the performance of BeeGFS
metadata servers. As usual, the optimal settings depend on your particular hardware and usage
scenarios, so you should use these settings only as a starting point for your tuning efforts.
Benchmarking tools such as mdtest
would help you identify the best settings for your BeeGFS
metadata servers.
Some settings suggested here are non-persistent and will be reverted after the next reboot.
In order to keep them permanently, you could add the corresponding commands to /etc/rc.local
, as
seen in the example below, use /etc/sysctl.conf
or create udev rules to reapply them
automatically when the machine boots.
1#!/bin/bash
2echo 5 > /proc/sys/vm/dirty_background_ratio
3echo 20 > /proc/sys/vm/dirty_ratio
4echo 50 > /proc/sys/vm/vfs_cache_pressure
5echo 262144 > /proc/sys/vm/min_free_kbytes
6echo 1 > /proc/sys/vm/zone_reclaim_mode
7
8echo always > /sys/kernel/mm/transparent_hugepage/enabled
9echo always > /sys/kernel/mm/transparent_hugepage/defrag
10
11devices=(sda sdb)
12for dev in "${devices[@]}"
13do
14 echo deadline > /sys/block/${dev}/queue/scheduler
15 echo 128 > /sys/block/${dev}/queue/nr_requests
16 echo 128 > /sys/block/${dev}/queue/read_ahead_kb
17 echo 256 > /sys/block/${dev}/queue/max_sectors_kb
18done
Metadata access typically consists of many small random reads and writes. Since small random writes are inefficient on RAID-5 and RAID-6 (due to the involved read-modify-write overhead), it is generally recommended to store metadata on a RAID-1 or RAID-10 volume.
Similar as for storage targets, partitions should be stripe aligned as described in the Partition Alignment Guide.
Low-latency devices like SSDs or NVMes are recommended as the storage devices of metadata targets. See also: Disk Space Requirements.
Extended Attributes¶
BeeGFS metadata is stored as extended file attributes (xattr) on the underlying file system. One
metadata file will be created for each file that a user creates. BeeGFS Metadata files have a size
of 0 bytes (i.e., no normal file contents). Access to extended attributes is possible with the
getfattr
tool.
If the inodes of the underlying file system are sufficiently large, xattrs can be inlined into the inode of the underlying file system, and additional data blocks are not required, reducing the disk usage. With xattrs inlined, access latencies are reduced as seeking to an extra data block is not required.
For backups of metadata, make sure the backup tool supports extended file attributes and the corresponding options are set. See Metadata Daemon Backup for details.
Hardware RAID¶
Partition Alignment & RAID Settings of Local File System¶
To get the maximum performance out of your storage devices, it is important to set each partition offset according to their respective native alignment. Check the Partition Alignment Guide for a walk-through about partition alignment and creation of a RAID-optimized local file system.
A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead, create the file system directly on the device, as shown in the following sections.
Metadata Server Throughput Tuning¶
In general, the BeeGFS metadata service can use any standard Linux file systems. However, ext4 is recommended, because it handles small file workloads (common on a BeeGFS metadata server) significantly faster than other local Linux file systems.
The default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your metadata servers.
Formatting Options¶
When formatting the ext4 partition, it is important to include options that minimize access times
for large directories (-Odir_index
), to create large inodes that allow storing BeeGFS metadata as
extended attributes directly inside the inodes for maximum performance (-I 512
), to reserve a
sufficient number of inodes (-i 2048
), and to use a large journal (-J size=400
):
# mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/sdX
If you also use ext4 for your storage server targets, you may want to reserve fewer space for inodes
and keep more space free for file contents by using -i 16384
or higher for those storage targets.
As metadata size increases with the number of targets per file, you should use -I 1024
if you
are planning to stripe across more than 4 targets per file by default or if you are planning to use
ACLs or store client-side extended attributes.
Due to the fact that ext4 has a fixed number of available inodes, it is possible to run out of
available inodes even if free disk space is available. Thus, it is important to carefully select the
number of inodes with respect to your needs if your metadata disk is small. You can check the number
of available inodes by using df -ih
after formatting. If you need to avoid such a limit, use a
different file system (e.g. xfs) instead of ext4.
By default, ext4 does not allow user space processes to store extended attributes. If the
beegfs-meta
daemon is set to use extended attributes, the underlying file system has to be
mounted using the option user_xattr
. This option also may be stored permanently in the
super-block:
# tune2fs -o user_xattr /dev/sdX
In systems expected to have directories with a large number of entries (over 10 million), the option
large_dir
must be set along with dir_index
. This option increases the capacity of the
directory index and is available in Linux kernel 4.13 or newer. Nevertheless, having a large number
of entries in a single directory is not a good practice performance-wise. Therefore, end users
should be encouraged to distribute their files across multiple subdirectories, even if the option
large_dir
is being used.
Mount Options¶
To avoid the overhead of updating the last access file timestamps, the metadata partition can be
mounted with the noatime
option without any influence on the last access timestamps that users
see in an BeeGFS mount. Disable last access timestamps by adding the noatime
argument to your
mount options.
The command below shows typical mount options for BeeGFS metadata servers with a RAID controller.
# mount -onoatime,nodiratime /dev/sdX <mountpoint>
IO Scheduler¶
The deadline scheduler typically yields the best results for metadata access.
# echo deadline > /sys/block/sdX/queue/scheduler
In order to avoid latencies, the size of the requests queue should not be too high, the default value of 128 is good.
# echo 128 > /sys/block/sdX/queue/nr_requests
When tuning your metadata servers, keep in mind that this is often not so much about throughput, but
rather about latency and also some amount of fairness: There are probably also some interactive
users on your cluster, who want to see the results of their ls
and other commands in an acceptable
time, so you should try to work on that. This means, for instance, that you probably don’t want to
set a high value for /sys/block/sdX/iosched/read_expire
on the metadata servers to make sure
that users won’t be waiting too long for their operations to complete.
Virtual Memory Settings¶
Transparent huge pages can cause performance degradation under high load and even stability problems on various kinds of systems. For RHEL 6.x and derivatives, it is highly recommended disabling transparent huge pages, unless huge pages are explicitly requested by an application through madvise:
# echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag
For RHEL 7.x and other distributions, it is recommended to have transparent huge pages enabled:
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# echo always > /sys/kernel/mm/transparent_hugepage/defrag
ZFS¶
Software RAID implementations demand more powerful machines than traditional systems with RAID controllers, especially if features like data compression and checksums are enabled. Therefore, using ZFS as the underlying file system of metadata targets will require more CPU power and RAM than a more traditional a BeeGFS installation. It will also increase the importance of disabling features like CPU frequency scaling.
It is also recommended being economical with the options enabled in ZFS, e.g., a feature like de-duplication uses a lot of resources and can have a significant impact on performance, while not providing any benefit on a metadata target.
Another important factor that impacts performance in such systems is the version of ZFS packages used. For example ZFS version 0.7.1 had some performance issues, while higher throughput was observed with version ZFS version 0.6.5.11.
Module Parameters¶
After loading the ZFS module, please set the module parameters below, before creating any ZFS storage pool.
IO Scheduler¶
Set the IO scheduler used by ZFS. Both noop
and deadline
, which implement simple scheduling
algorithms, are good options as the storage daemon is run by a single Linux user.
# echo deadline > /sys/module/zfs/parameters/zfs_vdev_scheduler
Data Aggregation Limit¶
ZFS is able to aggregate small IO operations that handle neighboring or overlapping data into larger
operations in order to reduce the number of IOPs. The option zfs_vdev_aggregation_limit
sets the
maximum amount of data that can be aggregated before the IO operation is finally performed on the
disk.
# echo 262144 > /sys/module/zfs/parameters/zfs_vdev_aggregation_limit
Creating ZFS Pools for Storage Targets¶
Basic options like the pool type, cache, and log devices must be defined at the creation time of the pool, as seen in the example below. The mount point is optional, but it is a good practice to define it with option -m, in order to control where the storage target directory will be located.
# zpool create -m /data/meta001 meta001 mirror sda sdb
Data Protection¶
RAID-Z pool types are not recommended for metadata targets due to the performance impact of parity updates. The recommended type is mirror, unless data safety is considered more important than metadata performance. In this case, RAID-Z1 should be used.
Data Compression¶
Data compression is a feature that should be enabled because it reduces the amount of space used by BeeGFS metadata. The CPU overhead caused by the compression functions is compensated by the decrease of the amount of data involved in the IO operations. Please use the data compression algorithm lz4, which is known to have a good balance between compression ratio and performance.
# zfs set compression=lz4 meta001
Extended Attributes¶
As explained earlier, BeeGFS metadata is stored as extended attributes of files, and therefore, this
feature must be enabled, as follows. Please, note that option xattr
should be set to sa
and
not the default on. The default mechanism stores extended attributes as separate hidden files. The
sa
mechanism inlines them with the actual files they belong to, making the management of
extended attributes much more efficient.
# zfs set xattr=sa meta001
Deduplication¶
Deduplication is a space-saving feature that works by keeping a single copy of multiple identical files stored in the ZFS file system. This feature has a significant performance impact and should be disabled if the system has plenty of storage space.
# zfs set dedup=off meta001
Unnecessary Properties¶
The BeeGFS storage service does not update access time. So, this property may be disabled in ZFS pools, as follows.
# zfs set atime=off meta001
System BIOS & Power Saving¶
To allow the Linux kernel to correctly detect the system properties and enable corresponding optimizations (e.g. for NUMA systems), it is very important to keep your system BIOS updated.
The dynamic CPU clock frequency scaling feature for power saving, which is typically enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS, where you will often find a general setting like “Optimize for performance”.
If frequency scaling is not disabled in the machine BIOS, recent Intel CPU generations require the
parameter intel_pstate=disable
to be added to the kernel boot command line, which is typically
defined in the grub boot loader configuration file. After changing this setting, the machine needs
to be rebooted.
If the Intel pstate driver is disabled or not applicable to a system, frequency scaling can be changed at runtime, e.g., via:
# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
You can check if CPU frequency scaling is disabled by using the following command on an idle system. If it is disabled, you should see the full clock frequency in the “CPU MHz” line for all CPU cores and not a reduced value.
# cat /proc/cpuinfo
Concurrency Tuning¶
Worker Threads¶
Storage servers, metadata servers, and clients allow you to control the number of worker threads by
setting the value of tuneNumWorkers
(in /etc/beegfs/beegfs-X.conf
). In general, a higher
number of workers allows for more parallelism (e.g. a server will work on more client requests in
parallel). For smaller clusters in the range of 100-200 compute nodes, you should set at least
tuneNumWorkers=64
in beegfs-meta.conf
. For larger clusters, tuneNumWorkers=128
is more
appropriate.
Parallel Network Requests¶
Each metadata server establishes multiple connections to each of the other servers to enable more
parallelism on the network level by having multiple requests in flight to the same server. The
tuning option connMaxInternodeNum
in /etc/beegfs/beegfs-meta.conf
can be used to configure the
number of simultaneous connections. The information provided in the client tuning guide also applies
to metadata servers: Parallel Network Requests.
Metadata RAID tuning¶
Advantages and disadvantages of software RAID vs hardware RAID¶
Software RAID allows to monitor performance of member devices. So if one drive is slower than the others, one will notice higher latencies, utilization, etc. in tools such as ‘iostat’ for that drive.
While a feature to monitor performance of individual member drives of a RAID volume is typically not present in internal hardware RAID controllers,
beegfs-ctl --storagebench
can be used to compare performance of different storage targets and thus identify suspicious targets.Software RAID doesn’t have a battery-backed cache, which can impact performance for small synchronous writes. However, also for streaming I/O, there are cases where a hardware RAID controller delivers higher throughput.
Replacing a disk with software RAID requires extra commands to add the new disk, while it can be as simple as swapping a drive in the case of hardware RAID.
Hardware RAID controllers do not support the TRIM/DISCARD command for SSDs. Software RAID supports this for RAID levels 0, 1, 10.
Write-intent bitmap¶
In order to speed up or to avoid resynchronization a write-intent bitmap should be set up. Hardware RAID vendors might call this feature a ‘write journal’. This bitmap devides the whole RAID array into bitmap-chunks and each of these chunks covered by a bit. Before writing to a bitmap-chunk a bit in the bitmap is set and once the write completes the bit is unset. If a RAID member device fails for some reasons, but it is possible to re-add without completely exchanging it, the bitmap allows to resync only these areas that have changed between failure and re-adding the device.
The bitmap may be added at any time, as it is located between RAID superblock and real data (sparse area, reserved for the bitmap and other meta-data).
Example
The bitmap-chunk size should not be smaller than 64MiB, as smaller sizes reduce write performance. Recent mdadm versions have the default of 64MiB, but older versions had a much smaller default.
$ mdadm /dev/md/storage --grow --bitmap=internal --bitmap-chunk=128M
Stripe cache size¶
For RAID5/RAID6 device one should increase the stripe cache size to >=8192
Default is 256 and with that small size read-modify-writes usually reduce write performance
Depending on the CPU and memory performance values >8192 might reduce performance
Required memory is: 4kiB * number-raid-devices * stripe-cache-size
Example:
$ echo 8192 > /sys/block/md127/md/stripe_cache_size
RAID metadata¶
Also see ‘man md’
Main RAID metadata is stored in the md super block. The current version is 1.2 and that is also the default for current mdadm versions.
Version 1.2 super block has an offset of 4kiB from the start of a device
Between super block and real data is another offset, for alignment and further meta data, e.g. the write-intent bitmap.
The default data offset has changed several times over the past few years, the current default is 128MiB.
Very recent mdadm versions allow to specify the data offset with the option
--data-offset=
RAID metadata may be examined with the commands
$ mdadm /dev/md/storage --detail $ mdadm --detail /dev/sdX # member device
Device name persistence¶
In order to keep device name persistence the md device should be accessed with an alias device name
Not good: /dev/md1
Good: /dev/md/storage
The device already can be created with this alias
$ mdadm --create /dev/md/storage --level=6 --chunk=128 -n12 --assume-clean /dev/sd[b,e-o]
The device should be added with that name to mdadm.conf
$ mdadm --examine --scan >>mdadm.conf
Location of mdadm.conf¶
RHEL and SLES:
/etc/mdadm.conf
Debian:
/etc/mdadm/mdadm.conf