New comparison of file systems

syncamide · May 29, 2024, 3:27pm

Hello guys!
First of all, I would like to add a comment regarding the previous tests (thanks to the users on the forum who pointed this out), indeed synthetic tests do not show the whole picture in full. I would especially like to note that the results I obtained for ZFS are not super representative (I believe they are not correct at all, since fio was conducted with the parameter of forced data synchronization).

My proxmox has been updated to kernel 6.8 and a module for bcachefs has been introduced into it, and it even works out of the box, which means it’s time to test bcachefs without disrupting production and compare it with other solutions. It is known that as the file system fills up, it starts to slow down, so I tried to conduct testing with almost full data to take this aspect into account. For testing, I took a not very large old Seagate Barracuda 500GB hard drive and recorded approximately 450 gigabytes of data on it. I decided to simulate a typical profile of files generated by the storage and perform actions similar to file listing and file deletion on different file systems. For this, I wrote a couple of scripts that generate junk files and simulate storage behavior in terms of writing blobs.

Testing Description
The script generated a dataset description with a clean volume of 449998717016 bytes (approximately 450 gigabytes based on 10), containing 896541 files.
The dataset description file in text form takes up ~90MB and roughly looks like this:
[{"folder": "a9", "filename": "dc7f4487dace400cc6afff1d720c3fc2ccca90318aebd74ae3a.sj1", "filesize": 41531}, {"folder": "ec", "filename": "bf51d38e5b64f31cd77a...
Each line of the file is an array of dictionaries describing the files. The number of dictionaries in a line indicates the number of simultaneous write streams, i.e., files from the same line will be created asynchronously, all at once. Thus, it will simulate data filling by storage (yes, of course, it could have added simulation of deletion here, but I thought it would be unnecessary at this stage). The number of simultaneous files varies from 1 to 16. The file sizes range from 4KB to 1MB.
A separate script will read the dataset file line by line and create files of the specified size from each array line in an asynchronous mode. The code is written in such a way that it will wait until the files are “written” (or the FS reports that the files are written) to disk, after which the next line will be read.
So, we have the same set of blobs for the tested file systems and more or less the same sequence of their creation, thus rsync definitely won’t “read files alphabetically smoothly moving the head from the beginning of the disk to the end,” it will definitely have to load the hard disk with random reads.

The following actions will be measured:

simple dataset file listing using find
dataset size calculation using du
dataset copying using rsync
dataset data deletion using rm -rf

Types of tested file system sets:

ZFS with default settings on a single disk with a special device (ashift=12)
ZFS with default settings on a single disk (ashift=12)
Bcachefs with default settings and two metadata replicas, one of them on an SSD (–metadata_replicas=2 and --data_allowed=btree)
Bcachefs with default settings on a single disk (FS imported without SSD)
Ext4 with formatting parameters -E lazy_itable_init=0,lazy_journal_init=0

Test bench:

System drive QUMO Novation Q3DT-128GMCY 128 GB
SSD disk for metadata QUMO Novation Q3DT-128GMCY 128GB
Main hard drive for testing Seagate Barracuda 7200.14 (AF) ST500DM002-1BD142 500 GB
SSD disk for copying data to (obviously faster than HDD) Micron 5200 MTFDDAK3T8TDC 3.84 TB
Linux kernel 6.8.4-3-pve (2024-05-02T11:55Z)
Operating system pve-manager/8.2.2/9355359cd7afbae4
Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
DDR4 memory 4x32GB (M386A4G40DM0-CPB)
Motherboard Machinist X99 MR9A
LSI 9216i (IT) controller

The following command was used for formatting in bcachefs:

bcachefs format
--data_allowed=journal,btree,user --label=hdd.drive1 /dev/sda
--data_allowed=btree --label=special.drive1 /dev/sdb
--metadata_replicas=2

Subsequently, to test the same dataset without an SSD, I mounted this device with the degraded option:
mount -o degraded -t bcachefs /dev/sda /mnt/filesystem_test

A regular dataset on ZFS was created with the command: zfs create filesystem_test/subvol-999-disk-0
For bcachefs, an analogous subvolume was created for the dataset: bcachefs subvolume create /mnt/filesystem_test/subvol-999-disk-0
For ext4, just a folder was used ¯_(ツ)_/¯

Results of the testing

values in seconds (less is better)

Conclusions
No matter how synthetic tests are done, in reality, everything will be completely different. In particular, I am very surprised by the ext4 performance in terms of deleting data, it seems to me that they are not fragmented to a sufficient extent, so now I will also need to simulate data fragmentation…

Bcachefs looks very promising, and I will definitely transfer several nodes to it. I am confident that now I will finally be able to turn on the file walker without any issues, as file traversal practically takes no time.

Subscribe to my Telegram: @temporary_name_here
Stay tuned…

Toyoo · May 29, 2024, 9:27pm

Very good attempt! There is one more thing I would consider to simulate garbage collection, but the numbers so far pretty much match my expectations: List files with a stat call every 10 files, and move to trash every second stat()ed file. You could probably just reuse your list of files in the json format, sort it by directory/file name, and run the necessary operations sequentially.

Was this ext4 on top of a zfs dataset?

Please also try two more scenarios:

128-byte inodes, i.e. mkfs.ext4 -i 128.
Regular ext4 with LVMcache, which would be the closest equivalent to ssd special.

It’s not that they aren’t fragmented. ext[234] put a lot of effort to locate inodes of a single directory close together, which then makes any operations on whole directories faster. Not easily possible for any CoW file system.

agente · May 30, 2024, 5:51am

yes please.

Alexey · May 30, 2024, 6:00am

I would like to recommend to use our benchmarking tool, because it’s emulates storagenode much more precise:

syncamide · May 30, 2024, 7:04am

it was directly on the hard drive

I will try

I’m not sure that this will give anything, because… To do this, the cache needs to be warmed up. This is not provided for in my test; before each pass I did a full system reboot. In any case, I don’t know how to warm up the cache “approximately the same for all fs” because the cache is washed out and this happens in different ways: MFU, LRU, and just washing out for no reason in the case of l2arc.

hmm interesting, I guess I’ll try this tool, however, in my opinion, writing data for any of the file systems is not a bottleneck in the case of storj (at least for me, I turned off sync), it’s much more interesting to simulate reading (since it’s expensive synchronous operation) and filewalker + deletions, because these operations are also expensive, because occur en masse and over a long period of time

Alexey · May 30, 2024, 7:14am

It’s doing so too, just try it

Toyoo · May 30, 2024, 12:45pm

LVMcache is a permanent cache, reboots won’t affect it. Though, indeed you point out an important thing—prewarming should be part of your testing procedure as well, in all cases.

When I was doing my benchmarks, I repeated each test sequence 11 times, and dropped the first result. If we can assume that the test sequence reflects regular node operations, then after one sequence we should be able to assume we reach the state of cache that would be there during regular node operations.

On the Storj benchmarking tool:

syncamide · May 30, 2024, 3:52pm

it is very cheap drive. It looks like it’s some kind of Russian brand. I can’t use normal drives coz my primary goal cheapest setup ever

I tested copying from a hard drive (rsync), therefore, for the purity of the experiment, I need to copy files to some device that is obviously faster than the hard drive can read

it’s debian. All my nodes runs at proxmox, I do not plan to change distributive. For the tests I used a clean installation from scratch

right

yes

ok I see. Is it available to cache only metadata?
I have a clear understanding that I don’t want to cache blobs. it just wipes the disks for nothing

Toyoo · May 30, 2024, 3:58pm

No, it’s a full block device cache.

syncamide · May 30, 2024, 3:59pm

then this is not my case, sorry

syncamide · May 30, 2024, 5:03pm

It was unused hardware

I just know it

It is too expensive for me coz it need at least one mirrored drive for special. Otherwise, I may lose all the data on the server

Toyoo · May 30, 2024, 5:37pm

Proxmox is Debian with a modified kernel and a few applications added, they say this pretty explicitly.

Alexey · May 31, 2024, 3:43am

You may report such posts. However, I’m not sure that social alias is AD. However, perhaps it’s not needed in the post.
@syncamide could you please remove this mention? You may place it to your profile instead and anyone who wants to follow you would be able to do so.

syncamide · May 31, 2024, 4:54am

table updated

no, I can’t
in fact, the article was written precisely because of this link.
If I still break the rules, please delete the entire thread

bre · May 31, 2024, 5:51am

General Notice:

The Telegram group listed in this thread is not affiliated with or endorsed by Storj

syncamide · May 31, 2024, 6:47am

I’ll be like a триван

zfsatemyram(.)ru

agente · May 31, 2024, 1:13pm

I think you can find metadata_only option

syncamide · May 31, 2024, 1:28pm

yes, indeed there is, let’s test it