Best filesystem for storj

Toyoo · December 30, 2023, 2:08am

؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜؜

Alexey · December 30, 2023, 2:32am

Nodes still can be run on potatoes (but I always thought that Raspberry Pi it’s a berry?), but one node per potato (Raspberry Pi with 1GB) please.

EasyRhino · December 30, 2023, 5:32am

The original benchmarks are very interesting. And I appreciate that they probably took a long time to set up.

But there was also a question of how much performance do you really need?

For instance, in my anecdotal experience of running a node on single 8 TB drives. Ext4 and zfs both perform a fine. Btrfs also performs acceptably, but maybe worryingly slow as the disk utilization percentage was significantly higher.

I’m storing the databases on an SSD in all cases though.

So if I have a setup or performance is fine, then adding any sort of metadata caching layer would just add cost and complexity without really generating any tangible improvement?

Of course, having single nodes with a larger size, or extremely aged nodes with a lot of fragmentation may make it worse

Alexey · December 30, 2023, 8:37am

I experienced a very much slower processing on BTRFS than on ext4 or even zfs (all - single drive without a cache device), something about 3-4x times slower than simple ext4.
If you would have databases also there (on BTRFS) it will crawl on regular operations until stall.

EasyRhino · December 30, 2023, 7:09pm

Plus when I tried to migrate nodes to a new disk using rsync, moving off a btrfs volume was sooooo slow, like over a week. Compared with other file systems which were merely days.

syncamide · December 30, 2023, 9:08pm

I wanted to solve this issue using LVM snapshots (I just wanted to make block-level dd of snapshots, that’s why I started testing on LVM in the first place), but the functioning of snapshots in LVM is just terrible: when creating a snapshot, if there are any changes, all lower layers are modified at the block level too. I really didn’t understand what kind of nonsense that is.
The speed of sending ZFS snapshots is still higher than rsync, but it still doesn’t satisfy me.

I agree that node is a very lightweight application, but solving the file walker issue with increased memory usage is not part of my plans or the concept of "don’t buy anything specifically for running nodes. I just want to make the file walker as painless as possible. I believe that the performance of any file system is sufficient for basic node operations such as reading and writing pieces.

I should have formulated my thought more clearly: I want to avoid waiting for cache warm-up

Yes, specifically for ZFS, I did two passes: the first one before filling with files, and the second one after filling with files. The result differs within the margin of error.

This works terribly, I haven’t found any way to make it write metadata to L2 in any predictable manner. Moreover, even when it occupies only 2-3% of the total capacity allocated to L2, it’s impossible to make zfs store data in it, as they are still prone to evicting. It’s just impossible, no matter what you try.

donald.m.motsinger · December 31, 2023, 10:19pm

You’re doing it wrong. You should have used vgextend, pvmove and vgreduce to move a volume group to another disk. This can be done online without unmounting the filesystem.

syncamide · January 1, 2024, 9:03am

I need to transfer lv to another system via network.

Alexey · January 1, 2024, 12:47pm

LVM is not capable to doing so… You may use a thin provisioned LV, then snapshot would be a static. However, LVM cannot import the snapshot out of the box, but there are workarounds…

I would not recommend to use them though.

syncamide · January 1, 2024, 1:21pm

The bottom layer under snapshot will be corrupted after dd
In lvm snapshots works very weird way
Of course I want to dd while node is operate. In another case there no need lvm or snapshots coz I can dd it in unmounted state.

Dragon · January 2, 2024, 11:21pm

Really an interesting discussion here. Thank all of you.
Shure, I am totally lost in this … but

My 15 years old Windows Laptop is working as a Server for me
This will do many good work, also Storj.

Until now I have 32 Nodes running, all around from 700GB each.
Most of my HDD are slow, external ones with less capacity.

A pretty smal partition is cool … and …
RoboCopy.exe is my friend to backup a Partition if i have to cleanup by checkdisc.
If have written a smal batch script to do this work over night sometimes.

I am prepared to loose a Node by accident. It doesnt really matter for me.
I prepared this way my nodes for many month (min. 15 month) to be profitable …
After the last payout change this will become really more important for us.
Our total payment was around 40% minimized on my side.

This is my way to handle StorJ space with 32 Nodes.
Hope to give you complete another way to look on this.

Be kind

s-t-o-r-j-user · January 5, 2024, 8:15pm

Have you maybe tried any of those tools? link

syncamide · January 16, 2024, 9:20am

Best filesystem for storj part 2
Guys, hello everyone.

It’s been barely a year since the new year, and I’ve almost worn out my hard drive with tests and prepared another note on file system performance comparison.

Today, I looked at the impact that the LVM layer has on different file systems on it. For testing, I used fio with the same parameters as in the first part, two identical tests were conducted for the file system on LVM and without it. The partition still consists of 2400G, no data was previously written to it, the test was conducted on an empty file system. In the case of BTRFS, the result was slightly unexpected, I have no explanation why it turned out this way, at first, I even thought that I accidentally ran the test on an SSD, but after running it again, the picture turned out to be exactly the same. Take a look at the graphs.

Comparison of reading with LVM and without in MiB/s (higher is better)

Comparison of recording with LVM and without in MiB/s (higher is better)

I also supplemented the main test with the file systems ntfs and btrfs+meta. In btrfs, it is not possible to completely move the metadata to a separate disk, but I managed to create a semi-hybrid volume from ssd + hdd, in which the metadata seems to be in raid1 mode (that is, on both disks), and the data in single mode, only on the hard disk. I reproduced the file listing in Windows using the command dir /s /b, and initially decided to calculate the folder sizes using the utility du from Mark Russinovich, however, during the process of calculating the folder size, the calculation time seemed suspiciously long (the code seems to be suboptimal), so I measured the time simply by opening the folder properties and using a timer to measure how long it took to determine the size of the occupied space.

Reading and writing fio in MiB/s (higher is better)

File listing time and space occupancy calculation in seconds (less is better)

In the next note, I plan to reproduce the operation of the storj node using a script (in terms of writing and deleting files) in such a way that the disk is thoroughly fragmented, and to measure the time it takes to copy the data (a la rsync and robocopy), as well as to measure the time it takes to delete all files.

Subscribe to my Telegram channel Telegram: Contact @temporary_name_here and
Stay tuned…

snorkel · January 16, 2024, 10:56am

I don’t believe these tests are representative for big nodes and with not enough RAM.
Do a test with a Storj w/r pattern on a 20TB filled node with the same number of files and sizes as a real storagenode (if you want, I can provide you a statistic from US1 for 4TB of data - no. of files for each size), 16 GB RAM and no SSD cache. And even than you are not close to real usage. You should mimic the maintenance services aswell, like FW, GC, etc.

syncamide · January 16, 2024, 11:17am

The series of articles is a summary of my personal experience operating storj nodes and does not claim to be the ultimate truth, but rather represents my personal opinion.
Seriosly, one of the main things I want to do is to get the cheapest setup possible.

I am trying!

syncamide · January 17, 2024, 7:50am

It seems to be a well-known fact that ZFS is not about speed in general.

Probably storj is the only case where it’s better to use single disks (which I am doing and intend to continue doing).

syncamide · January 17, 2024, 8:09am

It’s strange for me, everything I know about Germany doesn’t even allow me to imagine that such sayings could be true. In my perception, everything related to Germany is always about practicality, precision, meticulousness, and so on.
I have a template break)

Alexey · January 18, 2024, 4:35am

For me BTRFS results looks too good to be true for storagenode, since we have multiple confirmation, that BTRFS almost worst FS for storagenode: Topics tagged btrfs

JWvdV · January 20, 2024, 7:54am

For me too, especially since also an additional layer of complexity has been involved. Although I think it mostly has to do with the fragmentation of the meta data due to COW nature, so especially when the node is some months old. And I think these benchmarks would be most interesting to perform in different ‘life cycles’ of a node.

s-t-o-r-j-user · January 20, 2024, 11:18am

How about Lustre and Ceph? Any chance to adjust those filesystems for storagenode purposes or are they rather a far cry for us? Lustre supports both ext4 and zfs but it might not be an easy setup. Ceph is supported by Canonical LXD / MicroCloud but its default setup is rather not suitable straight out of the box. Disclaimer: I do see some potential benefits but those filesystems were designed for rather large environments. Anyway, apart to NFS, are you maybe considering doing some testing on them in the foreseeable future @IsThisOn? :- )