Best filesystem for storj

syncamide · December 28, 2023, 3:06pm

Hello guys!
I decided to conduct some research to choose the optimal file system in terms of the performance/cost ratio per 1 TB. The series of articles is a summary of my personal experience operating storj nodes and does not claim to be the ultimate truth, but rather represents my personal opinion.
The storj node itself is a lightweight application and under normal/typical circumstances, it does not consume a large amount of resources in terms of CPU/RAM. However, even in normal situations, the node can perform an incredibly large number of input-output operations. As long as you are not using SSD disks (which I assume you are not, as it can be quite expensive), the hard disk will always be the weakest link. The most challenging aspect in this regard is the file indexing process (filewalker), which starts every time the node is restarted, as well as periodically running processes for garbage collection and removal of expired data.

The problem
In general, the problem with the file traversal can be solved quite radically by simply disabling it completely, but such a solution may lead to the accumulation of unaccounted files, which will eventually force you to store unpaid data. I don’t have information on how relevant this statement is, so let’s assume that the developers’ recommendations suggest leaving the traversal enabled.
The file traversal has two modes of operation: normal and lazy. The lazy option appeared relatively recently and performs file traversal with low priority, which theoretically should minimize the impact of the traversal on other processes, but not eliminate its influence completely. I would like to express my special thanks to the people who:
added the ability to disable the traversal completely
added the ability to perform lazy traversal
Since on large nodes the number of files is counted in millions, file traversal in the background of other disk activities can take up to several days, and each time the node is restarted, the traversal process will start from the beginning. Not a pleasant situation, especially if the node is running on your NAS and you suddenly want to watch a movie from it. The file traversal in normal mode can completely slow down the disk.
So, the goal of my testing is to choose a file system for which the file traversal creates the fewest problems. In this testing, file systems that allow you to move the metadata to a separate storage are considered as a separate position. It is worth explaining this possibility in more detail: metadata includes the file name, its size, information about the physical location of the file on the hard disk, and other attributes. In case of using an external device with metadata, the file traversal (as well as all operations for deleting and searching files) will be performed on this device, without affecting the actual files on the hard disk. It is important to understand that when the storage device with the metadata fails…

Comparison of Meta Organization in ZFS and Bcachefs
In ZFS, metadata is stored on a separate device, which is called a special device in ZFS terminology. If this device is present in the pool, all metadata is written to it and only to it, until the available space is fully exhausted. After that, the metadata writes switch back to the hard disk. In raidZ configurations, it is not possible to remove the special device from the pool, so be careful - once you add it, you won’t be able to remove it without completely destroying the information. On the special device, you can write not only metadata, but also all files smaller than a certain size, which is determined by the special_small_blocks parameter.

In bcachefs, there are several types of information, and among them, we are interested in the special metadata_target setting, which determines where the metadata will be written. It is possible to simultaneously write metadata to both the main device and the metadata_target. There is a separate option called durability=0, which allows you to specify that the device will be used exclusively as a cache. Additionally, there is a promote_target (similar to ARC) that operates independently. Thus, the ways of configuring and storing metadata in bcachefs are somewhat more flexible compared to ZFS.

Methodology of testing.
I have prepared a test bed in which I tried to reproduce identical conditions for all file systems. Unfortunately, at the time of testing, it was not possible to conduct experiments on the same kernel version, since bcachefs is being built on kernel 6.7, while ZFS is only compiled on kernel 6.6. The tests were conducted on the following platform:

Operating System: Ubuntu 22.04.3 LTS
Kernel: Linux 6.2.0-39-generic (ZFS) and Linux 6.7.0-rc4+ (bcachefs)
Motherboard: Supermicro X9DRD-EF/A-UC014
CPU: Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
RAM: 64GB
HDD: Seagate Exos X16 ST16000NM001G-2KK103
SSD: QUMO Q3DT-128GMCY (meta)
SSD: Silicon Motion ADATA SU800 (system)

In my testing, I used thin LVM partitions. For each file system on the disk, all existing partitions were destroyed beforehand, and the disk was then freshly partitioned. The partition size was set to 2400GB, and the partition was formatted with the desired file system (all file systems were created with default parameters). After that, the following operations were performed for each file system:

A test node of approximately 1.9TB, containing around 8 million files, was copied from an external SSD drive using rsync.
Testing was conducted on the remaining free space using FIO. The testing parameters were as follows:
fio --name=fiotest --filename=${DESTINATION_DIR}/fio.img --size=128GB --rw=randrw --bs=128K --direct=1 --rwmixread=25 --ioengine=libaio --iodepth=32 --group_reporting
Measurement of the time it takes to calculate the size of a directory using ‘du’.
Measurement of the time it takes to output a list of all files using ‘find’.

The last two steps were performed to imitate file walking as much as possible.

Results
I present to you a comparison of the test results:

file listing using du and find, result in seconds (lower is better)

random read and write using fio, result in MiB/sec (higher is better)

Conclusions.
At the moment, the results indicate that bcachefs looks the most promising.
(In conclusion), in the near future, I plan to experiment more with bcachefs and find the optimal configuration from my point of view.
Stay tuned.

I run a Telegram blog about Storj. Subscribe if you find it necessary.
Thank you for your time!

Roxor · December 28, 2023, 5:15pm

ZFS metadata certainly can be on a separate device. But it’s not required, nor the default config.

Doesn’t the special device start storing on the main HDDs once it’s 75% full?

You can still remove/swap the disks the special device is composed of, I believe. Like if you had a 3-way mirror for the special device you could pull a SSD out for maintenance or something?

This is an interesting comparison! I’m more familiar with ZFS, but always like reading about other options.

Since ZFS will cache metadata in ARC… is it possible spending money on RAM (instead of a dedicated-and-probably-mirrored SSD metadata device) be a better use of funds? Like let individual HDDs be discrete ZFS zpools (no mirroring, parity, L2ARC or ZIL involved) have 128GB RAM (max in many consumer systems, but cheap) and let ARC warm up? Nodes get restarted way more often than systems do: so ARC going cold on reboot shouldn’t be a problem.

I think testing this would mean restarting a node many times, to let filewalker run, and see if the ARC-only setup starts to approach the SSD-cached-metadata speeds?

Cool post!

zip · December 28, 2023, 6:16pm

Very nice work.
Would it also be possible to measure how long it will take to delete all the files as deletes are a significant part of the node I/O?
And do you know if bcachefs can be used as single drives (not a pool of drives as ZFS for example) and ideally with one cache drive for multiple of these single drives without a need to do partitions on this cache drive?
I’m thinking about doing some sort of caching, probably LVM cache in the near future, but was also thinking about doing bcachefs on top of LVM as with LVM cache you apparently have to create at least one cache LV for each slow LV, which might be inconvenient if you have many slow LVs.

Edit: I’m also subscribed to your Telegram but was only able to understand couple of things, as for example иностранный агент and couple of curse words .

daki82 · December 28, 2023, 6:30pm

what about fragmentation? do you just assume the optimum?

syncamide · December 28, 2023, 8:36pm

Absolutely correct observation.

No, meta will be writed on a special device until it is completely filled

You can replace any of the drives inside the special device, increase/decrease count of drives inside the mirror, increase the number of mirrors, but you cannot remove the special device itself or decrease the number of mirrors in the RAIDZ configuration.

The main goal pursued by this article is to find the cheapest and most effective way to pass the filewalker. Storing all the metadata in RAM is a costly endeavor. Additionally, cache warming is something that I want to avoid.

I think yes, it is completely worthy to be measured, I will take care of it.

From what I know about bcachefs for now, I suspect that separate partitions on SSD or separate SSDs are required for metadata relocation from individual disks. That seems to be the case, but at the moment, I can’t say for certain. This is a question for further investigation, which I plan to undertake.

In this my research, the aspect of fragmentation is not addressed.
To be honest, I don’t even know how to simulate fragmentation.

daki82 · December 28, 2023, 8:40pm

I read about something like storj-simulator here, but not sure what it does. nor if it simulates nodes being full and online.

syncamide · December 28, 2023, 8:46pm

I need a fragmentation simulator
To be honest, I don’t plan to consider the aspect of fragmentation at all in this series of articles.

daki82 · December 28, 2023, 8:50pm

Thats just fair for every FS

Toyoo · December 29, 2023, 2:00am

Nice work!

Some notes/questions:

On lvm thin:

Lvmthin may impact the measurements in a pretty significant way. One thing is the need to allocate space, which increases I/O for writes in your fio test. As such, fio will exagerate the cost of performing initial writes.
Lvmthin will also impact placement of writes. With the default chunk size of 32kB for fio this means that when it tries toread a block of 128kB, it will have to do it from four different 32kB chunks. For file walkers this may mean that metadata may be interspersed with data by your file copy procedure, as chunk allocations will likely be in order of performed writes, whereas some file systems would normally try to optimize placement of metadata.

As such: What was the chunk size? Can you try the same with plain lvm volumes or partitions? If you cannot, increasing chunk size as much as possible might be a decent alternative. Or, instead of measuring wallclock time you may try to measure the number of I/O operations done to the logical volume.

You operate in the regime of high amount of RAM. How do you control the file walker imitations for caching?

At least for ext4, inode placement has a heavy impact on performance of file walkers. Plain file copy will not reproduce what you would get by operating a node: ext4 prefers to put inodes close to the inode of the directory the file is placed in (for nodes: temp), as opposed to the nice case where inodes for all files of a two-letter directory are close together (the result of a regular file copy). To reproduce this though you would have to reproduce the I/O operations done by a node. As such, your numbers for ext4 file walkers might be… unrealistically optimistic I do not know whether the same happens to other file systems.

Alexey · December 29, 2023, 2:38am

For the node this will happen automatically during a filewalker on start.

syncamide · December 29, 2023, 12:01pm

I will check this on a “raw” device

not sure what you mean, but I can say that I do a remount operation and also flush caches before each test round

any ideas how to simulate/reproduce a node io operations or result similar this actions in right way and with constant result?

Toyoo · December 29, 2023, 5:56pm

Sounds good.

The only idea I have is replicating the order in which files are created, moved and deleted. I’ve got a piece of python code that did that based on node logs for my own benchmarks. I promised to share it long time ago, but never had time… I’ll try to find some time next week.

arrogantrabbit · December 29, 2023, 6:33pm

It does not. This is not a realistic scenario. As @Alexey said above — server is rarely restated, vast majority of its operation happens with warm caches. Filewalker at start is a great way to prewarm them following system reset.

Flushing cashes makes this comparison meaningless: caches exist to improve performance. Comparing performance while disabling performance features of the file system makes no sense.

Separately, more details are required on the measurement methodology because some results don’t make any sense.

Coming up with a good benchmark is not trivial. For example, ZFS batches writes, if your test is smaller than the transaction group size you will get meaningless results. Or maybe that’s what you want given small amount of storj load.

Performance at max load vs 20% will also be drastically different.

odarriba · December 29, 2023, 6:39pm

In my experience (several years running nodes of 16TB) disk fragmentation is the biggest issue for performance. Once the disk is close to be full and deletes/new puts occur, performance is degraded over time and seek times of the filesystem start to increase a lot.

When that happens, IO Wait of the system starts to grow too (as expected).

Also, have you considered the performance degradation on ZFS when it starts to be near to fully occupied? For me that made Ext4 a clear winner on the mid-long term.

aad · December 29, 2023, 7:30pm

From an engineering perspective, file listing speed is definitely interesting, but does it matter? btrfs is almost 3x slower than the fastest solution, taking a little over 8 minutes to complete. Sounds impressive, but given how infrequently files are walked (compared to overall uptime), there isn’t much to indicate which filesystem is better to maximize revenue.

I came to a similar conclusion, there is no advantage to COW systems, as user files are only ever created or deleted, never modified. Most of the features that make ZFS appealing aren’t needed in a storj node (i.e. no need to encrypt, no need to RAID, no point in snapshots, data is encrypted so not possible to compress).

Read and write caching systems can be separate from the filesystem on Linux (see bcache, dm-cache). I am partial to bcache on top of ext4 (different from bcachefs), and though I suspect it would perform as well as bcachefs, I don’t have any hard proof and am too lazy to create it ;).

odarriba · December 29, 2023, 7:45pm

It matters, there are at least two operations (filewalker and piece deleter) that need to list files massively.

My first node (a 10TB one with ~4 years old) takes about 30-45 minutes of full IO to perform a filewalker seek.

That means 30-45 minutes pf failing uploads, downloads and potentially audits (but those are retried AFAIK). Now with lazy operation it behaves better, but not as close as if I copy an old node in a new FS (which means days of copying but then filewalker running in < 5 minutes)

zip · December 29, 2023, 8:03pm

This only is the case if you have enough memory to keep the data cached, which not always is an option. I even dare to say it mostly isn’t an option if we are to be following the rule of use only what you have, especially as the node grows.
Also the goal is to determine how the file system performs, not how successfully Kernel caches the data, or how fast the memory subsystem is.

arrogantrabbit · December 29, 2023, 10:21pm

I strongly disagree.

Then that system is not suitable for anything, including running a storage node: You don’t have excess resources to run it, you don’t even have enough to run your primary storage, apparently. Because if you had — storj usecase is very lightweight.

Exactly, nodes don’t exist in a vacuum, they run on already existing and configured system. Adding ram is a no-brainer for half decent system performance for whatever the main reason the system exists, and storj gets to benefit from it too.

All those people here who complain that the storage node consumes all of the resources of their server just have no idea what they are doing, sorry. Storagenode performance requirements are minuscule. They don’t matter. 200-300 iops they create under normal work are negligible noise. Filewalker takes 5 minutes reading metadata from chance/ssd — not an issue either.

Caching policy is part of filesystem design and implementation. It makes no sense to castrate a filesystem and see how it survives that. In addition, ZFS ARC behaves much differently than your standard Linux cache solution. It provides massive advantage and is pretty much required to get any half-decent performance — disabling it is silly. Running ZFS without enough ram is simply dumb — it wasn’t designed for that. Benchmarking it — moreso.

Proper benchmark will define requirements, build the system to get the best outcome from each filesystem, anccording to the requirements, configure it properly, and test that specific benchmark.

Otherwise the value of such tests is zero.

zip · December 29, 2023, 11:17pm

A bold statement.

I believe there are users running this on various small computers, small NAS boxes, small VMs etc. and these by default don’t have much RAM, yet alone some NVMe cache.
And these boxes are perfectly capable running SMB, backups etc., but once you put a reasonably sized node on it and it will start doing GC filewalker, used space filewalker and trash dumps at the same time, combined with not so great file system and you have a problem.
The file system choice is also sometimes limited. Not everyone can run ZFS, some people are running configurations where ZFS would be redundant etc.

But this don’t scale very well if for each node you need 16GB of RAM and an NVMe cache. What if someone would be running 100 nodes, would that require 2TBs of RAM and 16TBs of NVMe cache? NVMe cache you also have to RAID for redundancy and it also wears out.

Then those many hours spent on lazy filewalker implementation were wasted as will be the work on improving the trash handling.

We should be considering the very worst scenario with not enough RAM and no NVMe cache, because even 1% improvement by changing the file system for a more suitable one times 23000 nodes will overall improve the performance of the whole network.
So these tests do have value even if they are not exactly scientific.

arrogantrabbit · December 30, 2023, 2:05am

It would be interesting to see storj telemetry to assess how many operators are running on a potato. I feel those are negligible minority, and are doomed anyway: Hard drives IOps are capped, and any filesystem tweaking only gives you marginal improvement, moving the ceiling by a handful of IOPs: without drastic measures (offloading small files and metadata onto SSD) sooner or later those nodes will choke. This is a certain inevitability.

For example, certian users here are fixated on defragmetinf the disk — this helps gain a few IOPs and buy a few months of growth, but does not change anything in the long run.

This is the same issue here: any tweaks will provide marginal improvement, while proper storage design for handling many small files would both improve storj usecase and the backup/filestorage/etc.

I absolutely believe this was a bewildering waste of engineering time.

Again, how many nodes are running on a potato? I guarantee you not 23000.

And filesystem tweaks only marginally change time to first byte. Putting stuff onto SSD — decimates it. It gives qualitative difference.

We should not optimize for the worst scenario. Folks running nodes on a 2GB raspberry pi’s are doomed anyway once network usage increases — they are already pushing hdd iops limit.

Adding SSD not just merely improve performance slightly but unlocks years of growth.

To condense what I’m trying to say into one thought: there is no reason to waste time on marginal improvements when a cheap and drastically better solution is available, even for raspberry pi users, why by no means are representative of the whole node population — because if it was — we are in trouble.