BTRFS vs EXT4 vs ZFS Filesystem for storj

StoreMe · April 11, 2022, 4:38pm

Hi folks,

just want to ask which filesystem do you use for storj.

I was using BTRF over years but this year first time watching the IO oft the HDDs with nmon. And I saw that ally ma HDD did a lot of work even when nearly 0 network traffic.
So I decided to transfer the node to a new HDD with ext4 instead of BTRFS. The copy process with rsync was that slow 100 GB in 10 hours that I had to stop storagenode and start the copy with rsync without node running. It was quiet faster but I took 7 days for all data.

IO on my BTRFS HDDs were nearly between 70-100 % and my new ext4 HDD was nearly 4 % or nothing.

After copy process the node starts faster and the IO on the HDD is on it’s max on 5%.

So I think the Metadata and thousands of small storj files doesn’t work well together. It seems that BTRFS is always optimizing the whole time metadata for all that files.

BTRFS was really slow as hell. What do you think about BTRFS for storj. I think it’s not good for storj and really slow.

Does somebody use ZFS ? What are the IO of ZFS on nmon. Can somebody tell his results ?

In conlusion I tink on Linux ext4 is the best filesystem for storj. But BTRFS has it’S upsides to but not for storj. And in my case i was 7 days offline with node. without node offline there was no chance to copy the data to the new HDD.

Alexey · April 11, 2022, 8:10pm

Your assumptions are correct, the best FS for Storj on Linux is ext4.
BTRFS is not good for Storj, it’s also not production ready. The only normally working implementation is on Synology.
ZFS is useful only when you want to have all its features and ready to put up with shortcomings, including the slow operation of this FS (in comparison with ext4). It’s useless without more than one disk (for mirror) or at least three for raidz.
See

Toyoo · April 11, 2022, 9:18pm

My experience: -t 300 not enough - #25 by Toyoo

BrightSilence · April 11, 2022, 9:27pm

Yeah, I’m totally with @Alexey here. ZFS has its uses, but a lot of the benefits of the file system offer little to no value for running a node and the performance will suffer due to overhead to support those features. It’s also a lot more resource intensive on CPU and RAM. It can be done right, but it can also be done very wrong. And in order to make up for performance loss you would almost need to integrate SSD cache into the setup as well.
Ext4 is lean and fast and does everything a storagenode needs without any of the additional overhead. I do recommend mounting with noatime to prevent writes every time a file is read. But with that you should be good to go.

though I have a suspicion that @SGC may be around soon with a counter argument

skookum · April 11, 2022, 10:52pm

Regarding EXT4, how do you all plan for and/or cope with the failure of your ext4 partition? Are you backing it up locally to another drive and if so what sort of program do you use? Or are you using some sort of redundancy scheme in LVM?

I ask because the road to building up a node is a long one and disaster recovery could set someone back months yes?

Alexey · April 12, 2022, 5:46am

Backups useless. As soon as you restore from backup - your node would be disqualified for lost pieces since backup.
When you use LVM, do not use simple (spanned, striping, etc.) volume across drives (RAID0) - with one disk failure the whole volume is lost. If you want to waste money - you can create an array but only with parity or mirror (RAID1), also RAID10, there is no other options. In this case the ZFS would be better - because it’s able to recover after bitrot unlike usual RAID5/RAID6. Mostly capable configuration - one node per one HDD. And please do not try to put petabytes in one location - the current equilibrium (between uploads and deletes) around 24TB after years.

See RAID vs No RAID choice

SGC · April 12, 2022, 8:13am

As much as i like to promote the wonders of ZFS.
i barely even used anything else since i switch to linux and started to run storagenodes,
so it’s difficult for me to really compared ZFS to the other options.

i am running EXT4 for my OS, but thats doesn’t really seem faster or better…
but it does work if the ZFS module in linux fails for whatever reason, or if the ZFS pool isn’t imported by default…

wouldn’t call ZFS slower, if anything it’s over all much faster than EXT4 sure maybe not directly, but all the caching and other tricks ZFS does, makes many things run so much better, like when doing migrations using zfs send | zfs recv
ZFS also has more options for caches and such things than EXT4

However ZFS does come at one major downside, it needs more resources in just about every way one can imagine, ZFS is best with more disks, more RAM, more CPU, more Bandwidth, more SSD’s for caching…

but if you held it ZFS will keep your data alive and error free, which does make life a little bit easier in some cases…

however using EXT4 basically only requires you to select it when you partition / create a filesystem, then its just something you basically never deal with again…

BTRFS…
well it does sound fancy, has some nice features we long to see in ZFS, but i also hear that often those fancy features doesn’t work… or its not 100% stable… which is really something you want from a file system that one expects to be using for years…

some of the EXT4 partitions created today will still be around in 10 years… sure they might not see much use, other than being located on a disk that hasn’t quite made it to the trash yet…
but still you will be able to plug the disk in and access the data on it.
storagenodes might take many years to really be good… do you really trust BTRFS to go 4 years without errors or revisions that doesn’t outdate your pool.

BTRFS = sure if you want to test it out, but i suspect you will eventually be forced to migrate the storagenode, if it survives…

ZFS = if you are running raid this is imo the only option, because there are not many if any equally capable storage solutions… it won’t be quick to learn, and it won’t be light to run…
but it will be error free if you tend it a bit and use redundancy.
it does seem like its alien technology, pretty much everything you think you know about filesystem and partitions are just different with zfs

EXT4 = Its common, good support and documentation, easy to use, for many things this is the way to go… not because you want to… but because nothing else makes sense…

not like we all don’t want a rocketship…like ZFS or BTRFS or Ceph but its difficult to drive a rocketship to the supermarket.

starov · April 12, 2022, 8:41am

Yepp. Have same problem with btrfs. I was formatted my disk in btrfs as experement. So, after 6 month i have had extreamly low IO performance. After that i was mount my btrfs partition with this parameters: defaults,nofail,noatime,nodiratime,nodatacow,noautodefrag . This decision was increase IO performance little bit. Today every 1-2 month at once run full defragmentation btrfs partition. After that i have normaly performance on btrfs partition. Last defragmentation lasted 3 days.

John_bravo87 · April 12, 2022, 9:06am

I’ve never done any benchmarks, so can’t tell if zfs is slower or faster when comparing to ext4/btrfs, but the 2 raidz2 nodes (one 4 disk- 17TB, and one 5 disk - 20TB, both with 16G of ram) I’ve been running for over 3 years now still have 0 failures. And I did have a number of power outages during that time. If not enabling deduplication ram usage is tolerable (arc cache takes up 50% ram by default, but it can be configured to use less manually). So, if you can afford it, I’d recommend it for for at least reliability.

atomsymbol · April 12, 2022, 1:58pm

I am using bcache+btrfs on one of my storage nodes. The cache size is 64 GB, located on a NVMe device, writethrough mode, btrfs filesystem usage is 1.2 TB. The reported long-term read efficiency of the cache is 80%.

Without a NVMe cache, I agree that btrfs isn’t suitable for running a Storj node.

On the other hand: The question is whether running Storj nodes solely on HDDs without any NVMe caches is advisable at all - irrespective of whether it is btrfs, ext4, zfs or something else.

Running "time find ." in the top-level Storj directory in the bcache+btrfs filesystem finishes faster than running the same command in the top-level Storj directory of another node that is using ext4 (without bcache). But, bcache+ext4 (I haven’t tried this combination yet) might be faster than bcache+btrfs.

Running btrfs defragmentation on the whole Storj node is unnecessary, because the fragmentation of most Storj blob (data) files in btrfs is 1 or 2 extents. The main issue is fragmentation of btrfs directories and metadata - not fragmentation of file data. Thus, it should be sufficient to run "btrfs fi de" on all directories and without the -r switch (without recursion).

BrightSilence · April 12, 2022, 2:20pm

If you get SSD cache involved the file system matters a whole lot less.

But it’s definitely more than possible to run a node just fine on a single EXT4 HDD without SSD cache. In fact if you don’t want to spend on SSD’s (an honestly, you shouldn’t if it’s just for Storj) then by far the best performing setup for multiple disks is just running a node per disk in EXT4. That will spread out the load and with 2 nodes or more your HDD’s will never run into a bottleneck. (Though still avoid SMR if at all possible)

Iigloo · April 12, 2022, 2:51pm

I’m in trouble using Ds920+ 4 disk in SHR1.drives are running like crazy. BTRFS. The node uses around 2.65TB. Is there a way to remedy this without buying new hardware?

Toyoo · April 12, 2022, 3:00pm

My nodes are hosted on HDDs with no explicit flash cache of any kind. They work fine and became fast enough after migrating to ext4.

rit · April 13, 2022, 9:35am

You are suffering from what could be called I/O amplification at the RAID level. For every block you write to the array, you will generate a total of 1 read from another drive and then 2 writes to 2 independent drives (data and parity). So for that single block 3 drives must be active.

BTRFS then must do the same thing for all its processes (copy on write, inode tree etc).

All of the above could be made much worse if you have SMR based drives.

As for a remedy, apart from rebuilding your array to use a simpler RAID structure and/or file system, there is not much you can do. The best option would be (if you can), purchase a good USB drive and attach it to your DS920, using EXT4. You then use this for Storj until the drive fails.

Iigloo · April 13, 2022, 9:58am

Thanks for your answer. Would 2 nvme ssd drives as a write/read cache remedy my problems? I use 4 x 16TB x16 Exos. CMR. I really don’t want to restart the node in case I have a drive error.

BrightSilence · April 13, 2022, 10:39am

It would, but I would seriously consider whether that is worth the expense. Unless it would be helpful for other things on the system too. If you do use it for other stuff, be sure to tell it to skip the cache for sequential IO to make the most efficient use of it and limit wear on the SSDs.

Large file copies don’t benefit from the SSD cache that much as the RAID arrays tend to be quite fast for larger operations to begin with.

Iigloo · April 13, 2022, 12:15pm

Honestly The only thing I do with my NAS is to run Storj, HomeAssistant and file backup. I try to avoid breaking the SHR1 array because I don’t have enough space to store the data elsewhere. Synology DS920+ can use 2 Nvme drives so use as a write/read cache. Right now utilisation is 32%. Would it go up once I have 10TB stored data. Right now I have 2.65TB?