Information to SNO's regarding ZFS/btrfs vs. ext4 on spinners

nzdk · November 18, 2020, 5:04pm

Hi all,

My storage node wasn’t performing as it used to, and I started looking into it. I set it up 13 months ago, on 1x 3TB 7200rpm drive, to get into the platform. Today, it was simply not performing; loading the dashboard took 10-15 seconds.

I had 3,6 million files on 2,8 TiB of storage, which shouldn’t be an issue. However… it took 15 seconds to ‘ls’ in a blob folder. It took 3 hours do run a ncdu on the volume… It took almost half a day for the runner to traverse the drive after restart.

It seems that CoW file systems are problematic for the recurring deletion and insertion of ~2Mbyte files and their associated metadata.

I’ve tried ‘btrfs filesystem defrag’ and ‘-o autodefrag’, however the metadata seems to be the problem, and that doesn’t defragment using those. So… I kinda advise against storagenodes on CoW systems - traditional RAID + ext4 or other more traditional filesystems. I haven’t had the chance to see how XFS or ReiserFS (if anyone still uses that) holds up to this type of workload.

I spent 4 days rsync’ing the files to a 4TB 5900rpm drive - on ext4 - and 3 hours rsyncing again while the strage node was down, to catch up on changes. Based on iostat it took approx 140 iops to read from the source drive, and 16 iops to write the same to the destination drive, at around 7,5 Mbyte/sec.

I switched the mount points, and…

Now all the slowness of the dashboard is gone, it takes 0.0x seconds to ls in a folder, and it takes 80 seconds to scan the entire volume. The node starts tremendously faster, and the disk isn’t constantly grinding - and that at a disk that has slower rotational speed. I probably lost a lot of uploads on account of this.

Use the above information as input to your own decisions.

@storjteam - do you have a load simulator that creates files and deletes them like the network would do, in order to test the storagenode filesystem workload?

Kind regards,
Martin

Doom4535 · November 18, 2020, 5:10pm

I use BTRFS without issues now, but I had to disable future CoW and remove CoW from my existing files. My current BTRFS /etc/fstab settings are: defaults,nofail,noatime,nodiratime,nodatacow,noautodefrag.

I ultimately had to disable CoW, shutdown my node, move the data, and copy it back (rewrite without CoW), then restart the node. I also tried a defrag before this (in theory it would allow you to do everything with the node online, but that was essentially locking up my node (watch your RAM usage balloon when the drives get slow).

nzdk · November 18, 2020, 5:17pm

Thank you for your input @Doom4535

For your isolated case, you are basically getting all the hassles of btrfs without any of the benefits? I guess you have a reason, could you elaborate?

Multidevice (mdadm/lvm) and ext4 seems to be much more mature, and without CoW, it has a comparable feature set, still being lighter.

You lose (almost) all inherent btrfs filesystem integrity checks with nodatacow, yet the filesystem still is resource heavy compared to a simple ext4.

Thank you for detailing your mount options.

I still use btrfs on SSD, where it doesn’t seem to be much of an issue (despite the maintainers warn about some workloads even on SSD).

Doom4535 · November 18, 2020, 5:48pm

I wouldn’t call it a hassle, the BTRFS tools are actually pretty snazzy, just different than most folks are used to. If I wanted I actually can still snapshot it (if I migrate to another server, I would likely use this), and it will create a sortof single level CoW for the snapshot (Stack Overflow). I’m currently using single disk volumes (there is constant back and forth on which is better for StorJ), so not using the RAID features currently, but I like how BTRFS allows one to essentially transparently add drives to increase the space (I’m essentially using it like mdadm with LVM rolled into the filesystem I guess).

I haven’t benched it; but my understanding is that the journal used by ext4 (can also be turned off, but sorta like I did with CoW) isn’t the best for database like loads. My thought process was that BTRFS without CoW might be better than ext4 without a journal, and I am willing to take an increased risk of an incomplete write if it allows me to increase my total writes; with the StorJ network being responsible for ensuring that data integrity is maintained.

Is it better than ext4? I have no idea, I would need to get another drive to compare against I guess. I don’t believe anyone has posted a definitive performance comparison, especially for nodatacow at this point.

Reddit FS benchmarks
Some older nodatacow benchmarks
Another older benchmark for RAID with CoW

kevink · November 18, 2020, 7:18pm

I’m using zfs on all my nodes (3 single HDDs) and COW is not a problem. In fact, I think my nodes are performing a lot better than on ext4 or btrfs because of the ARC/L2ARC cache, which is great for the database accesses. Additionally I use a SSD-Log that caches the synchronous database writes, which makes the HDD load very smooth. No latency issues with dashboard and commands, even when the filewalker process is running.
Personally I can only recommend using zfs if you have enough RAM and an old SSD you can throw in as a SLOG and L2ARC. (I use a very old 64GB SSD. But even new SSDs in small sizes are really cheap now)

Pentium100 · November 18, 2020, 7:52pm

I use ext4 inside a zvol.
Yes, if I tried to copy all of the data it would probably not be very fast, however, it is extremely unlikely that the node would access the files in such manner, it is more likely that the node would read files randomly (based on how customers access them) than one-by-one like rsync does. So, while zfs performance with rsync is not that great, it should not matter in the real situation.
Also, my node VM and the server have enough RAM to cache the metadata - if I restart my node (without restarting the VM, say, after an update), it’s almost instant.
Writes are accelerated a bit using SLOG and I could use L2ARC to cache reads, but this is not necessary right now.

Using ext4 on md-raid would prevent me from using the array for anything other than the node. Also, I think that ext4 would become just as fragmented after a while. Now that you rsync’ed all of the data to ext4, it is stored neatly in order, but if those files were created/deleted as part of normal node operations they would be just as fragmented as with zfs.

Toyoo · November 18, 2020, 8:06pm

Yeah, I migrated from btrfs to ext4 and found a 10× reduction in write IOPS. So, well, I agree.

Krey · November 18, 2020, 8:39pm

I share 150TiB on multiple hosts all with ZFS and not have any problems. I offloading db, orders and zfs metadata to ssd mirror. Nodes starts downloading in second when it started. Listings reasonable fastest too. Node moved on connection speed because i use zfs send/receive instead of rsync.

nzdk · November 18, 2020, 10:30pm

Cool guys, thank you for all the input! It is much appreciated.

I’ll monitor the performance and get back if something worth mentioning happens. My next endeavor when this ext4 implodes will be zfs with a ssd thrown in for caching.