Hello guys!
I decided to conduct some research to choose the optimal file system in terms of the performance/cost ratio per 1 TB. The series of articles is a summary of my personal experience operating storj nodes and does not claim to be the ultimate truth, but rather represents my personal opinion.
The storj node itself is a lightweight application and under normal/typical circumstances, it does not consume a large amount of resources in terms of CPU/RAM. However, even in normal situations, the node can perform an incredibly large number of input-output operations. As long as you are not using SSD disks (which I assume you are not, as it can be quite expensive), the hard disk will always be the weakest link. The most challenging aspect in this regard is the file indexing process (filewalker), which starts every time the node is restarted, as well as periodically running processes for garbage collection and removal of expired data.
The problem
In general, the problem with the file traversal can be solved quite radically by simply disabling it completely, but such a solution may lead to the accumulation of unaccounted files, which will eventually force you to store unpaid data. I donāt have information on how relevant this statement is, so letās assume that the developersā recommendations suggest leaving the traversal enabled.
The file traversal has two modes of operation: normal and lazy. The lazy option appeared relatively recently and performs file traversal with low priority, which theoretically should minimize the impact of the traversal on other processes, but not eliminate its influence completely. I would like to express my special thanks to the people who:
added the ability to disable the traversal completely
added the ability to perform lazy traversal
Since on large nodes the number of files is counted in millions, file traversal in the background of other disk activities can take up to several days, and each time the node is restarted, the traversal process will start from the beginning. Not a pleasant situation, especially if the node is running on your NAS and you suddenly want to watch a movie from it. The file traversal in normal mode can completely slow down the disk.
So, the goal of my testing is to choose a file system for which the file traversal creates the fewest problems. In this testing, file systems that allow you to move the metadata to a separate storage are considered as a separate position. It is worth explaining this possibility in more detail: metadata includes the file name, its size, information about the physical location of the file on the hard disk, and other attributes. In case of using an external device with metadata, the file traversal (as well as all operations for deleting and searching files) will be performed on this device, without affecting the actual files on the hard disk. It is important to understand that when the storage device with the metadata failsā¦
Comparison of Meta Organization in ZFS and Bcachefs
In ZFS, metadata is stored on a separate device, which is called a special device in ZFS terminology. If this device is present in the pool, all metadata is written to it and only to it, until the available space is fully exhausted. After that, the metadata writes switch back to the hard disk. In raidZ configurations, it is not possible to remove the special device from the pool, so be careful - once you add it, you wonāt be able to remove it without completely destroying the information. On the special device, you can write not only metadata, but also all files smaller than a certain size, which is determined by the special_small_blocks parameter.
In bcachefs, there are several types of information, and among them, we are interested in the special metadata_target setting, which determines where the metadata will be written. It is possible to simultaneously write metadata to both the main device and the metadata_target. There is a separate option called durability=0, which allows you to specify that the device will be used exclusively as a cache. Additionally, there is a promote_target (similar to ARC) that operates independently. Thus, the ways of configuring and storing metadata in bcachefs are somewhat more flexible compared to ZFS.
Methodology of testing.
I have prepared a test bed in which I tried to reproduce identical conditions for all file systems. Unfortunately, at the time of testing, it was not possible to conduct experiments on the same kernel version, since bcachefs is being built on kernel 6.7, while ZFS is only compiled on kernel 6.6. The tests were conducted on the following platform:
- Operating System: Ubuntu 22.04.3 LTS
- Kernel: Linux 6.2.0-39-generic (ZFS) and Linux 6.7.0-rc4+ (bcachefs)
- Motherboard: Supermicro X9DRD-EF/A-UC014
- CPU: Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
- RAM: 64GB
- HDD: Seagate Exos X16 ST16000NM001G-2KK103
- SSD: QUMO Q3DT-128GMCY (meta)
- SSD: Silicon Motion ADATA SU800 (system)
In my testing, I used thin LVM partitions. For each file system on the disk, all existing partitions were destroyed beforehand, and the disk was then freshly partitioned. The partition size was set to 2400GB, and the partition was formatted with the desired file system (all file systems were created with default parameters). After that, the following operations were performed for each file system:
- A test node of approximately 1.9TB, containing around 8 million files, was copied from an external SSD drive using rsync.
- Testing was conducted on the remaining free space using FIO. The testing parameters were as follows:
fio --name=fiotest --filename=${DESTINATION_DIR}/fio.img --size=128GB --rw=randrw --bs=128K --direct=1 --rwmixread=25 --ioengine=libaio --iodepth=32 --group_reporting
- Measurement of the time it takes to calculate the size of a directory using āduā.
- Measurement of the time it takes to output a list of all files using āfindā.
The last two steps were performed to imitate file walking as much as possible.
Results
I present to you a comparison of the test results:
file listing using du and find, result in seconds (lower is better)
random read and write using fio, result in MiB/sec (higher is better)
Conclusions.
At the moment, the results indicate that bcachefs looks the most promising.
(In conclusion), in the near future, I plan to experiment more with bcachefs and find the optimal configuration from my point of view.
Stay tuned.
I run a Telegram blog about Storj. Subscribe if you find it necessary.
Thank you for your time!