New comparison of file systems

first part
second part

Hello guys!
First of all, I would like to add a comment regarding the previous tests (thanks to the users on the forum who pointed this out), indeed synthetic tests do not show the whole picture in full. I would especially like to note that the results I obtained for ZFS are not super representative (I believe they are not correct at all, since fio was conducted with the parameter of forced data synchronization).

My proxmox has been updated to kernel 6.8 and a module for bcachefs has been introduced into it, and it even works out of the box, which means it’s time to test bcachefs without disrupting production and compare it with other solutions. It is known that as the file system fills up, it starts to slow down, so I tried to conduct testing with almost full data to take this aspect into account. For testing, I took a not very large old Seagate Barracuda 500GB hard drive and recorded approximately 450 gigabytes of data on it. I decided to simulate a typical profile of files generated by the storage and perform actions similar to file listing and file deletion on different file systems. For this, I wrote a couple of scripts that generate junk files and simulate storage behavior in terms of writing blobs.

Testing Description
The script generated a dataset description with a clean volume of 449998717016 bytes (approximately 450 gigabytes based on 10), containing 896541 files.
The dataset description file in text form takes up ~90MB and roughly looks like this:
[{"folder": "a9", "filename": "dc7f4487dace400cc6afff1d720c3fc2ccca90318aebd74ae3a.sj1", "filesize": 41531}, {"folder": "ec", "filename": "bf51d38e5b64f31cd77a...
Each line of the file is an array of dictionaries describing the files. The number of dictionaries in a line indicates the number of simultaneous write streams, i.e., files from the same line will be created asynchronously, all at once. Thus, it will simulate data filling by storage (yes, of course, it could have added simulation of deletion here, but I thought it would be unnecessary at this stage). The number of simultaneous files varies from 1 to 16. The file sizes range from 4KB to 1MB.
A separate script will read the dataset file line by line and create files of the specified size from each array line in an asynchronous mode. The code is written in such a way that it will wait until the files are “written” (or the FS reports that the files are written) to disk, after which the next line will be read.
So, we have the same set of blobs for the tested file systems and more or less the same sequence of their creation, thus rsync definitely won’t “read files alphabetically smoothly moving the head from the beginning of the disk to the end,” it will definitely have to load the hard disk with random reads.

The following actions will be measured:

  • simple dataset file listing using find
  • dataset size calculation using du
  • dataset copying using rsync
  • dataset data deletion using rm -rf

Types of tested file system sets:

  • ZFS with default settings on a single disk with a special device (ashift=12)
  • ZFS with default settings on a single disk (ashift=12)
  • Bcachefs with default settings and two metadata replicas, one of them on an SSD (–metadata_replicas=2 and --data_allowed=btree)
  • Bcachefs with default settings on a single disk (FS imported without SSD)
  • Ext4 with formatting parameters -E lazy_itable_init=0,lazy_journal_init=0

Test bench:

  • System drive QUMO Novation Q3DT-128GMCY 128 GB
  • SSD disk for metadata QUMO Novation Q3DT-128GMCY 128GB
  • Main hard drive for testing Seagate Barracuda 7200.14 (AF) ST500DM002-1BD142 500 GB
  • SSD disk for copying data to (obviously faster than HDD) Micron 5200 MTFDDAK3T8TDC 3.84 TB
  • Linux kernel 6.8.4-3-pve (2024-05-02T11:55Z)
  • Operating system pve-manager/8.2.2/9355359cd7afbae4
  • Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
  • DDR4 memory 4x32GB (M386A4G40DM0-CPB)
  • Motherboard Machinist X99 MR9A
  • LSI 9216i (IT) controller

The following command was used for formatting in bcachefs:

bcachefs format
--data_allowed=journal,btree,user --label=hdd.drive1 /dev/sda
--data_allowed=btree --label=special.drive1 /dev/sdb
--metadata_replicas=2

Subsequently, to test the same dataset without an SSD, I mounted this device with the degraded option:
mount -o degraded -t bcachefs /dev/sda /mnt/filesystem_test

A regular dataset on ZFS was created with the command: zfs create filesystem_test/subvol-999-disk-0
For bcachefs, an analogous subvolume was created for the dataset: bcachefs subvolume create /mnt/filesystem_test/subvol-999-disk-0
For ext4, just a folder was used ¯_(ツ)_/¯

Results of the testing


values in seconds (less is better)

Conclusions
No matter how synthetic tests are done, in reality, everything will be completely different. In particular, I am very surprised by the ext4 performance in terms of deleting data, it seems to me that they are not fragmented to a sufficient extent, so now I will also need to simulate data fragmentation…

Bcachefs looks very promising, and I will definitely transfer several nodes to it. I am confident that now I will finally be able to turn on the file walker without any issues, as file traversal practically takes no time.

Subscribe to my Telegram: @temporary_name_here
Stay tuned…

10 Likes

Very good attempt! There is one more thing I would consider to simulate garbage collection, but the numbers so far pretty much match my expectations: List files with a stat call every 10 files, and move to trash every second stat()ed file. You could probably just reuse your list of files in the json format, sort it by directory/file name, and run the necessary operations sequentially.

Was this ext4 on top of a zfs dataset?

Please also try two more scenarios:

  1. 128-byte inodes, i.e. mkfs.ext4 -i 128.
  2. Regular ext4 with LVMcache, which would be the closest equivalent to ssd special.

It’s not that they aren’t fragmented. ext[234] put a lot of effort to locate inodes of a single directory close together, which then makes any operations on whole directories faster. Not easily possible for any CoW file system.

1 Like

yes please. :+1: :+1:

1 Like

I would like to recommend to use our benchmarking tool, because it’s emulates storagenode much more precise:

1 Like

it was directly on the hard drive

I will try

I’m not sure that this will give anything, because… To do this, the cache needs to be warmed up. This is not provided for in my test; before each pass I did a full system reboot. In any case, I don’t know how to warm up the cache “approximately the same for all fs” because the cache is washed out and this happens in different ways: MFU, LRU, and just washing out for no reason in the case of l2arc.

hmm interesting, I guess I’ll try this tool, however, in my opinion, writing data for any of the file systems is not a bottleneck in the case of storj (at least for me, I turned off sync), it’s much more interesting to simulate reading (since it’s expensive synchronous operation) and filewalker + deletions, because these operations are also expensive, because occur en masse and over a long period of time

1 Like

It’s doing so too, just try it :slight_smile:

1 Like

Before we dive too deep into reading your results, I have some questions about the testing methodology.

What exotic SSD is that? Does that perform similar to a normal drive? QLC?
Can’t you use a normal drive?

Why is that needed, when you only want to test

Odd OS choice for a benchmark, since you have to reinstall the OS for these tests. Debian/Ubuntu would be easier, but to each its own. Just to be clear, we are not testing VMs right? You run these command directly in the Proxmox shell?

Is that ok with the forum rules @Alexey? To me, this looks like some very strange promotion that does not belong in this forum.

LVMcache is a permanent cache, reboots won’t affect it. Though, indeed you point out an important thing—prewarming should be part of your testing procedure as well, in all cases.

When I was doing my benchmarks, I repeated each test sequence 11 times, and dropped the first result. If we can assume that the test sequence reflects regular node operations, then after one sequence we should be able to assume we reach the state of cache that would be there during regular node operations.

On the Storj benchmarking tool:

1 Like

it is very cheap drive. It looks like it’s some kind of Russian brand. I can’t use normal drives coz my primary goal cheapest setup ever

I tested copying from a hard drive (rsync), therefore, for the purity of the experiment, I need to copy files to some device that is obviously faster than the hard drive can read

it’s debian. All my nodes runs at proxmox, I do not plan to change distributive. For the tests I used a clean installation from scratch

right

yes

ok I see. Is it available to cache only metadata?
I have a clear understanding that I don’t want to cache blobs. it just wipes the disks for nothing

No, it’s a full block device cache.

1 Like

then this is not my case, sorry

1 Like

Cheapest setup is only unused hardware and not buying at all :slight_smile:
If it is QLC, has a lower TBW or warranty, it is not even really cheaper. TCO are more than the initial price.

Why not just test reads, when you wanna test reads? I fail to see the need for another drive to write to.

Great.

How do you come to this conclusion?

Looking at your table, to me it seems like zfs + special vdev was the fastest.

It was unused hardware :slight_smile:

I just know it

It is too expensive for me coz it need at least one mirrored drive for special. Otherwise, I may lose all the data on the server

1 Like

Proxmox is Debian with a modified kernel and a few applications added, they say this pretty explicitly.

2 Likes

You may report such posts. However, I’m not sure that social alias is AD. However, perhaps it’s not needed in the post.
@syncamide could you please remove this mention? You may place it to your profile instead and anyone who wants to follow you would be able to do so.

table updated

no, I can’t
in fact, the article was written precisely because of this link.
If I still break the rules, please delete the entire thread

1 Like

Yeah I know, I just wanted to make sure he is not testing in VMs, because his last two test were pretty wonky in my opinion.

You wrote it only as an advert? Urghhh…

4 Likes

General Notice:

The Telegram group listed in this thread is not affiliated with or endorsed by Storj

3 Likes

I’ll be like a триван

zfsatemyram(.)ru

I think you can find metadata_only option