BeeGFS Hive Index¶

In today’s world where enormous amounts of data are generated every day, especially in supercomputing datacenters, managing and querying metadata at scale is a significant challenge. Additionally, accessing internal metadata of the BeeGFS parallel filesystem requires custom mechanisms.

BeeGFS Hive Index addresses these challenges by creating a hierarchical metadata index that extracts filesystem metadata along with BeeGFS-specific attributes and stores them efficiently in a local database. It provides a simple, high-performance interface to query detailed metadata, including file names, timestamps (access, modification, creation), permissions, file size, extended attributes, and BeeGFS-specific information like stripe pattern, storage targets, and owner metadata node ID, all without affecting general filesystem operations.

Since filesystems are continuously updated with ongoing operations, the index must be updated to reflect the latest changes. Users can rescan a specific subdirectory manually to update its metadata or perform a recursive rescan to update the entire directory tree, ensuring that the index stays in sync with the filesystem.

Description¶

BeeGFS Hive Index enables users to efficiently search through billions of files and directories while minimizing the impact on filesystem performance. Its multithreaded design allows parallel scanning of multiple directories, ensuring fast and efficient metadata extraction. The index supports POSIX user permissions, directory tree attributes, and hierarchical representation.

By default, BeeGFS Hive Index extracts BeeGFS-specific metadata for regular files using BeeGFS ioctls. However, these additional ioctls can increase the filesystem load during index creation. To mitigate this, users have the option to disable BeeGFS-specific metadata extraction.

Only the root user can create the index, as the process requires scanning the entire directory hierarchy, which may include files and directories owned by different users. Therefore, it is recommended that an administrator perform the index creation. Maintaining a single index directory per BeeGFS filesystem helps ensure consistency with ongoing filesystem changes.

Since the index preserves user permissions, it can be shared among users and administrators. Even with a shared index, users can query only their own data, ensuring access control.

BeeGFS Hive Index supports two indexing methods:

In-Tree Index Stored within the BeeGFS filesystem.
Out-Tree Index Stored at a different location, such as another local filesystem.

Warning

Creating an index or rescanning directories to update the index should not run in parallel across multiple instances, as this can cause database corruption or an inconsistent index state.

In-Tree index¶

In the In-Tree Index method, the index is created directly inside the BeeGFS filesystem. Each subdirectory will contain a separate .bdm.db database file, which holds metadata specific to that subdirectory. The database stores detailed metadata information about the files within the directory, such as filenames, timestamps, file size, extended attributes, and more.

However, creating the index inside BeeGFS can potentially interfere with ongoing filesystem operations, especially during high-traffic periods. To minimize the impact, it is recommended to schedule metadata scanning during periods of low filesystem activity if the index needs to reside within the BeeGFS filesystem.

Out-Of-Tree Index¶

The Out-Of-Tree Index method allows users to create the BeeGFS filesystem index at an external location, such as a local filesystem or an NFS share.

During the index creation process, a directory hierarchy is generated at the chosen index location, mirroring the structure of the source directory. Each directory in the index stores metadata information about its files and subdirectories. The index hierarchy preserves the user permissions of the source directory, enabling faster metadata queries while maintaining access control.

In BeeGFS, metadata can be distributed across multiple metadata nodes. With the Out-Of-Tree Index, metadata is stored locally, making queries significantly faster without affecting critical filesystem operations.

../_images/hive_index_out-of-tree-index.png

In-Tree vs Out-Of-Tree Index¶

In-Tree Index:

Pros: No additional storage needed as index database files are stored inside the BeeGFS file system itself.

Cons: Indexing operations may impact filesystem performance, especially during high traffic periods.

Out-Tree Index:

Pros: As index databases are stored outside the file system, running queries on the indexed data will not affect the on-going important file system operations.

Cons: Requires dedicated storage space to store the indexed metadata of the file system.

BeeGFS Hive Index Database¶

BeeGFS Hive Index database (.bdm.db stored inside each subdirectory) holds three types of records as listed below:

Entries Table: Stores detailed metadata information about files and
links (filename, size, blks, mtime, ctime, atime, extended attributes, also BeeGFS specific metadata like stripe pattern, storage targets, metadata owner ID, the file’s entryID, the parent’s entryID, etc).
Directory Summary Table: Stores information like total files inside
the directory, total subdirectories inside the directory, total size of the directory (sum of all file’s sizes inside the directory) , etc.
Tree Summary Table (optional): Stores a summary of the complete
directory hierarchy. This information is present only at the top level directory. The table contains information like total files in the directory tree, total subdirectories, maxfilesize, minfilesize.

Step-By-Step Guide for BeeGFS Hive Index¶

Prerequisites¶

The following packages need to be installed on the node that runs BeeGFS Hive Index:

libpcre3

coreutils

SQLite

python3

python3-distutils

Installation¶

Visit the BeeGFS download page and follow the step-by-step directions to add the BeeGFS package repositories to your Linux distribution’s package manager. Using your distribution’s package manage install the beegfs-hive-index package.

Initial configuration¶

BeeGFS Hive Index commands use a configuration file which must be present at

/etc/beegfs/index/config

A sample configuration file will be installed during the package installation at /etc/beegfs/index/config.example which can be modified and renamed to /etc/beegfs/index/config

Following are the important parameters from the config file:

# Number of threads for running index operations
Threads=10

Number of threads to run the Hive Index commands. You should set this count as per the number of cores available on the system where the Hive Index will be created.

# absolute path to bee executables
# single path string
Executable=/opt/beegfs

The absolute path where the BeeGFS Hive Index binaries are installed. Default path is /opt/beegfs.

# Source directory path:Index directory path
IndexPaths=/mnt/beegfs:/mnt/index

Colon separated path to BeeGFS Filesystem directory for which index will be created and absolute path for the index directory. (<mount-point>:<index-directory-path>), for example IndexPaths=/mnt/beegfs:/work/index.

These two paths allow Hive Index commands to run without explicitly specifying absolute paths for source directory or Index directory.

If you want to create the index inside the file system directory itself, only specify the source directory path in IndexPaths, for example IndexPaths=/mnt/beegfs.

# Filesystem mount point path
MountPath=/mnt/beegfs

BeeGFS mount point.

BeeGFS Hive Index Utilities¶

All BeeGFS Hive Index operations can be performed through a beegfs tool using beegfs index.

Operations like creating an index, rescanning the index, listing the files from the index and finding files from the index are the subcommands to beegfs index.

BeeGFS Hive Index Creation¶

You can create an index using In-Tree (Inside Filesystem) or Out-Of-Tree(Outside Filesystem) options as per the requirements.

For creating the index use the subcommand beegfs index create.

In-Tree Indexing (Inside the Filesystem)

If you want to store the index inside the BeeGFS filesystem itself, specify the same path for both the source directory and the index directory: /etc/beegfs/index/config

IndexPaths=/mnt/beegfs:/mnt/beegfs

Out-Of-Tree Indexing (Outside the Filesystem)

If you want to store the index outside the BeeGFS filesystem, specify a different path outside filesystem for the index directory /etc/beegfs/index/config

IndexPaths=/mnt/beegfs:/mnt/index

See beegfs index create –help for all available options.

Warning

Do not run multiple create index instances for the same filesystem in parallel to prevent data inconsistency or database corruption.

Creating an In-Tree index¶

With the appropriate IndexPaths already set in /etc/beegfs/index/config simply run:

$ beegfs index create -B

Instead of using a config file you could also directly specify the path where the in-tree Index should be created and the path where BeeGFS is is mounted. You can also control the number of threads used for scanning and impose a maximum limit on memory used by the scanner:

$ beegfs index create -F /mnt/beegfs -I /mnt/beegfs --mount /mnt/beegfs --num-workers 4 -X 8G -s -B

Creating an Out-Of-Tree Index¶

If the config file has both file system source directory path and index path mentioned, beegfs index create command will pick up the paths from config file.

$ beegfs index create -F /mnt/beegfs -I /work/index --mount /mnt/beegfs -B

create out-tree index with tree summary table and memory limit 8G and #4 threads

$ beegfs index create -F /mnt/beegfs -I /work/index --mount /mnt/beegfs -X 8GB --num-workers 4 -s -B

beegfs index rescan¶

The index can be updated manually using the rescan functionality, which refreshes the metadata for a specific subdirectory in a previously indexed filesystem.

If users know which directories have changed over time, they can rescan only those specific directories to update the index. Alternatively, a periodic rescan of the entire directory tree can be scheduled to keep the index up to date. Running rescans during low-traffic periods is recommended to minimize the impact on important filesystem operations.

The rescan command ensures that: - Newly created files and directories are added to the index. - Deleted files and directories are removed from the index.

Modes of Rescan:

Rescan (Non-recursive)

Updates metadata for the specified subdirectory.

Indexes newly created files within the subdirectory.

Detects and indexes newly created immediate child subdirectories.

Deletes stale immediate child subdirectories from the index.

Does not update existing child subdirectories that were already indexed.

Rescan with recursion

Recursively updates the index for the entire subdirectory tree.
Updates metadata for all files and subdirectories, including existing indexed subdirectories.
Detects and indexes newly created files and directories at all levels.
Removes stale directories and files from the index.

One or more sub directories of the filesystem can be rescanned using:

See beegfs index rescan –help for all available options.

$ beegfs index rescan sub-dir1/ sub-dir2/

To rescan a subdirectory recursively:

$ beegfs index rescan  --recurse /mnt/beegfs/sub-dir1/

Warning

Avoid running multiple rescan instances for the subdirectories in parallel to prevent data inconsistency or database corruption.

beegfs index stats¶

The stats subcommand allows users to obtain file system statistics.

like the total number of files, directories or links in the directory hierarchy, files, directories, or links per level, maximum and minimum file sizes..

Note

beegfs index stats should be run from Index directory or the Filesystem directory. We can also provide an absolute path of directory.

See beegfs index create –help for all available options.

$ beegfs index stats filecount /work/index/arch/alpha

$ beegfs index stats -c total-filecount /mnt/beegfs

Some of the important options are:

1. total-filecount: Get the total number of files under a directory. It reports the count per uid. But with –cumulative shows cumulative numbers rather than per uid.

2. total-linkcount: Similar to total-filecount, gets the total number of links under a directory.

$ beegfs index stats -c total-filecount /mnt/beegfs
6

$ beegfs index stats total-linkcount /mnt/beegfs/test2
root 1

3. dirs-per-level/files-per-level/links-per-level: Get the count of <type>s in each directory level per uid.

$ beegfs index stats files-per-level /mnt/beegfs
root 0 14
root 1 801
root 2 11182
root 3 17738

$ beegfs index stats dirs-per-level /mnt/beegfs
root 0 1
root 1 19
root 2 468
root 3 1104

For other options please check the beegfs index stats --help

beegfs index find¶

The find subcommand can be used to find the files in an index directory hierarchy. find has very similar options to GNU find and allows users to get results by running queries over the index directory. If BeeGFS specific metadata is extracted during create index, that metadata can also be queried using find command. Hive’s find is way faster than running actual find commands over the file system.

Following are some of the options from beegfs index find --help

See beegfs index find –help for all available options.

Following are some of the examples for beegfs index find:

To get the list of files which are greater than 1GB in size

$ beegfs index find /mnt/beegfs --size +1G
/mnt/beegfs/dataset0/user1/test1.txt
/mnt/beegfs/dataset0/application.txt
/mnt/beegfs/dataset0/user2/test.txt

To get the list of files which are created within 24 hours.

$ beegfs index find /mnt/beegfs --ctime=-1
/mnt/beegfs/dataset0/file.txt

To get the list of files which are not accessed from the last 10 days.

$ beegfs index find /mnt/beegfs --atime -10
/mnt/beegfs/dataset2/file1.txt

Some important options to query BeeGFS specific metadata¶

Reverse map the entryID to filename:

$ beegfs index find /mnt/beegfs --entryid "0-660D2950-1"
/mnt/beegfs/linux.tar.gz

List files whose data is stored on given storage target:

$ beegfs index find /mnt/beegfs --targetid "101"
/mnt/beegfs/newfile
/mnt/beegfs/file3.txt
/mnt/beegfs/linux.tar.gz

List files whose metadata is stored on give metadata node:

$ beegfs index find /mnt/beegfs --ownerid "1"
/mnt/beegfs/dataset0/stg1/test1.txt
/mnt/beegfs/dataset0/stg1/test2.txt
/mnt/beegfs/dataset0/nmv/data1

beegfs index ls¶

beegfs index ls lists the index directory contents. It has similar options as standard ls commands. beegfs index ls can work with absolute paths and relative paths both. Run from index directory or file system directory By adding the --beegfs flag, users can print BeeGFS specific metadata for a file. if running with relative paths.

Displays the contents of the index directory.

This command works similarly to the standard “ls” command, supporting both absolute and relative paths. You can use it from within the index directory or from a filesystem directory when specifying relative paths.

See beegfs index ls –help for all available options.

Example: List the contents of the index directory at /mnt/index. $ beegfs index ls /mnt/index

To print the inode number of the files in a directory:

$beegfs index ls -i user/data/
3224807882485797791 core.c
9553679804207926350 kfence.h
10232643267511233810 kfence_test.c
16487850910921490669 Makefile
348824457185005735 report.c

To sort the files in a directory in descending order and print the size:

$ beegfs index ls arch/alpha/boot/ -Ss
bootpz.c
stdio.c
bootp.c
misc.c

To list the files in a directory with sizes in human readable format:

$ beegfs index ls -hls
-rw-rw-r-- 1 root root 496.0 Feb 11 08:26 COPYING
-rw-rw-r-- 1 root root 98.6K Feb 11 08:26 CREDITS
drwxrwxr-x 4 root root 144.0 Feb 11 08:26 data
drwxrwxr-x 81 root root 97.0 Feb 11 08:26 Documentation

List the files with BeeGFS specific metadata¶

-b or –beegfs: List the BeeGFS specific metadata for the regular files. It prints EntryID, ParentID, Metadata OwnerID, Chunk Size and Number of targets respectively.

$ beegfs index ls -b

4-66854319-2 0-668542DD-2 1 524288 5 analytics
1-66854319-2 0-668542DD-2 1 524288 2 edge_solution.txt
2-66854319-2 0-668542DD-2 1 524288 4 data.log

beegfs index stat¶

beegfs index stat displays file and directory metadata information. It has similar options as GNU stat. With a special --beegfs option, user can print BeeGFS specific metadata for a file.

beegfs index stat can be run with the absolute path of the file system or index directory from anywhere. To use relative paths, beegfs index stat commands should be run from inside a file system source directory or index directory.

$ beegfs index stat file.txt

File: '/dataset/index/file.txt'
Size: 727 Blocks: 2 IO Block: 524288 regular file
Device: h/ d Inode: 8523084679231612220 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context:
Access: 2023-01-03 09:09:48 +0100
Modify: 2022-02-11 09:26:32 +0100
Change: 2023-01-03 09:09:48 +0100
Birth:

To include BeeGFS specific metadata, add the ``–beegfs`` flag.

$ beegfs index stat --beegfs system.log
path: system.log
Entry type: file
EntryID: 0-66854319-2
ParentID: 0-668542DD-2
Metadata Owner ID: 2
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 524288
+ Number of storage targets: 4
 + Target ID 501
 + Target ID 601
 + Target ID 701
 + Target ID 801

beegfs index query¶

beegfs index query subcommand allows users to execute SQL queries directly over the index database. This command can be useful if users want to get specific metadata information which is not accessible using the Hive Index commands.

See beegfs index query –help for all available options.

List the database entries for files inside net directory

$ beegfs index query -I . -s "select * from entries"
|id|name|type|inode|mode|nlink|uid|gid|size|blksize|blocks|atime|mtime|ctime|linkname|xattrs|crtime|
ossint1|ossint2|ossint3|ossint4|osstext1|osstext2|pinode|ownerID|entryID|parentID|entryType|featureFlag|
stripe_pattern_type|chunk_size|num_targets|target_info|

1|devres|f|11002537685141370162|33188|1|0|0|25560551|524288|49923|1720521953|1720521953|1720521953|||0|0|0|0|0|||
140174793085336|1|7D8-668D14D1-2|0-668D1457-1|2|3|1|524288|2|1:2:|
2|README|f|16374403434744466135|33188|1|0|0|21258827|524288|41522|1720521953|1720521953|1720521953|||0|0|0|0|0|||
140174793085336|1|7D9-668D14D1-2|0-668D1457-1|2|3|1|524288|2|1:2:|
3|spec|f|17175560844328665268|33188|1|0|0|23669|524288|47|1720521954|1720521954|1720521954|||0|0|0|0|0|||
140174793085336|1|90E-668D14D1-2|0-668D1457-1|2|3|1|524288|2|2:1:|

Get the summary information for net directory

$ beegfs index query -I net/ -s "select * from summary"
|id|name|type|inode|mode|nlink|uid|gid|size|blksize|blocks|atime|mtime|ctime|linkname|xattrs|totfiles|totlinks|minuid|maxuid|
|mingid|maxgid|minsize|maxsize|totltk|totmtk|totltm|totmtm|totmtg|totmtt|totsize|minctime|maxctime|minmtime|maxmtime|minatime|
|maxatime|minblocks|maxblocks|totxattr|depth|mincrtime|maxcrtime|minossint1|maxossint1|totossint1|minossint2|maxossint2|
|totossint2|minossint3|maxossint3|totossint3|minossint4|maxossint4|totossint4|rectype|pinode|

|1|net|d|5920216113766933286|16893|72|0|0|76|524288|1|1672733464|1644567992|1672733467|||6|0|0|0|0|0|2250|89444|0|6|6|0|0|0|126954|1672733464|1672733467|
|1644567992|1644567992|1672733464|167273|3467|5|175|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|10872060183036577154|

Limitations and Known Issues¶

Currently, BeeGFS Hive Index supports only manual index creation and rescan of subdirectory or while directory tree on previously created index. It does not automatically keep the index in sync with ongoing filesystem operations. To update the index users needs to manually scan the filesystem to update the index with the latest information.

Potential Future Enhancements¶

Future versions of BeeGFS Hive Index may integrate with BeeGFS Watch, which provides a mechanism to deliver modification events from BeeGFS metadata servers. These events will be used to efficiently update the index, eliminating the need for full rescans and ensuring the index remains up-to-date with ongoing filesystem changes.

BeeGFS Hive Index¶

Description¶

In-Tree index¶

Out-Of-Tree Index¶

In-Tree vs Out-Of-Tree Index¶

BeeGFS Hive Index Database¶

Step-By-Step Guide for BeeGFS Hive Index¶

Prerequisites¶

Installation¶

Initial configuration¶

BeeGFS Hive Index Utilities¶

BeeGFS Hive Index Creation¶

Creating an In-Tree index¶

Creating an Out-Of-Tree Index¶

beegfs index rescan¶

beegfs index stats¶

beegfs index find¶

Some important options to query BeeGFS specific metadata¶

beegfs index ls¶

List the files with BeeGFS specific metadata¶

beegfs index stat¶

beegfs index query¶

Limitations and Known Issues¶

Potential Future Enhancements¶

Documentation

Table of Contents

Previous topic

Next topic