Hello guys!
First of all, I would like to add a comment regarding the previous tests (thanks to the users on the forum who pointed this out), indeed synthetic tests do not show the whole picture in full. I would especially like to note that the results I obtained for ZFS are not super representative (I believe they are not correct at all, since fio was conducted with the parameter of forced data synchronization).
My proxmox has been updated to kernel 6.8 and a module for bcachefs has been introduced into it, and it even works out of the box, which means itâs time to test bcachefs without disrupting production and compare it with other solutions. It is known that as the file system fills up, it starts to slow down, so I tried to conduct testing with almost full data to take this aspect into account. For testing, I took a not very large old Seagate Barracuda 500GB hard drive and recorded approximately 450 gigabytes of data on it. I decided to simulate a typical profile of files generated by the storage and perform actions similar to file listing and file deletion on different file systems. For this, I wrote a couple of scripts that generate junk files and simulate storage behavior in terms of writing blobs.
Testing Description
The script generated a dataset description with a clean volume of 449998717016 bytes (approximately 450 gigabytes based on 10), containing 896541 files.
The dataset description file in text form takes up ~90MB and roughly looks like this:
[{"folder": "a9", "filename": "dc7f4487dace400cc6afff1d720c3fc2ccca90318aebd74ae3a.sj1", "filesize": 41531}, {"folder": "ec", "filename": "bf51d38e5b64f31cd77a...
Each line of the file is an array of dictionaries describing the files. The number of dictionaries in a line indicates the number of simultaneous write streams, i.e., files from the same line will be created asynchronously, all at once. Thus, it will simulate data filling by storage (yes, of course, it could have added simulation of deletion here, but I thought it would be unnecessary at this stage). The number of simultaneous files varies from 1 to 16. The file sizes range from 4KB to 1MB.
A separate script will read the dataset file line by line and create files of the specified size from each array line in an asynchronous mode. The code is written in such a way that it will wait until the files are âwrittenâ (or the FS reports that the files are written) to disk, after which the next line will be read.
So, we have the same set of blobs for the tested file systems and more or less the same sequence of their creation, thus rsync definitely wonât âread files alphabetically smoothly moving the head from the beginning of the disk to the end,â it will definitely have to load the hard disk with random reads.
The following actions will be measured:
- simple dataset file listing using find
- dataset size calculation using du
- dataset copying using rsync
- dataset data deletion using rm -rf
Types of tested file system sets:
- ZFS with default settings on a single disk with a special device (ashift=12)
- ZFS with default settings on a single disk (ashift=12)
- Bcachefs with default settings and two metadata replicas, one of them on an SSD (âmetadata_replicas=2 and --data_allowed=btree)
- Bcachefs with default settings on a single disk (FS imported without SSD)
- Ext4 with formatting parameters -E lazy_itable_init=0,lazy_journal_init=0
Test bench:
- System drive QUMO Novation Q3DT-128GMCY 128 GB
- SSD disk for metadata QUMO Novation Q3DT-128GMCY 128GB
- Main hard drive for testing Seagate Barracuda 7200.14 (AF) ST500DM002-1BD142 500 GB
- SSD disk for copying data to (obviously faster than HDD) Micron 5200 MTFDDAK3T8TDC 3.84 TB
- Linux kernel 6.8.4-3-pve (2024-05-02T11:55Z)
- Operating system pve-manager/8.2.2/9355359cd7afbae4
- Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
- DDR4 memory 4x32GB (M386A4G40DM0-CPB)
- Motherboard Machinist X99 MR9A
- LSI 9216i (IT) controller
The following command was used for formatting in bcachefs:
bcachefs format
--data_allowed=journal,btree,user --label=hdd.drive1 /dev/sda
--data_allowed=btree --label=special.drive1 /dev/sdb
--metadata_replicas=2
Subsequently, to test the same dataset without an SSD, I mounted this device with the degraded option:
mount -o degraded -t bcachefs /dev/sda /mnt/filesystem_test
A regular dataset on ZFS was created with the command: zfs create filesystem_test/subvol-999-disk-0
For bcachefs, an analogous subvolume was created for the dataset: bcachefs subvolume create /mnt/filesystem_test/subvol-999-disk-0
For ext4, just a folder was used ÂŻ_(ă)_/ÂŻ
Results of the testing
values in seconds (less is better)
Conclusions
No matter how synthetic tests are done, in reality, everything will be completely different. In particular, I am very surprised by the ext4 performance in terms of deleting data, it seems to me that they are not fragmented to a sufficient extent, so now I will also need to simulate data fragmentationâŚ
Bcachefs looks very promising, and I will definitely transfer several nodes to it. I am confident that now I will finally be able to turn on the file walker without any issues, as file traversal practically takes no time.
Subscribe to my Telegram: @temporary_name_here
Stay tunedâŚ