How much "slower" is ZFS?

Toyoo · January 9, 2025, 10:28pm

pvmove is strictly better than dd for moving, as then I can do a live copy. Plus, I’m also getting access to tools like lvmcache when necessary.

Rassah · January 10, 2025, 4:30pm

I have had a horrible experience with ZFS on Storj. I have a large drive array of 8 drives, about 40TB of storage total, and figured I could use a lot of extra space to make money. I had the same idea too, that if a drive dies, I don’t lose my node and don’t have to start from scratch, earning less for the first 6 months.
Storj seemed to work okay, but over time zpool scrub would find issues with files, and eventually the zpool would just completely lock up, where I would have to mount it in read only mode, back everything up off of it, and rebuild it from scratch.
This happened about 3 times over a bit over a year. I’ve changed SATA cards to more advanced ones, I’ve added a better more powerful power supply, changed drives around, nothing helped. Storj just seems to overwhelm ZFS pools for some reason.
Since started it from scratch using just one of my spare drives running ext4, Storj had no issues, and my zpool has no issues.
I have no idea what’s actually causing it, and maybe the devs will figure something out, but for now I don’t trust it. Last this happened was last summer. Also, copying Storj pool from read only pool to a backup would take a week for a 8TB sized one due to millions of files.

arrogantrabbit · January 10, 2025, 4:50pm

This does not even approach the territory of “large drive or array”.

Then your disks are dying, or you have bad ram, or your power supply is trash, since you have already replaced the data cables (hopefully twice, because you still can replace bad cable with bad cable. It’s very hard to find sata cables that are not abhorrent trash; you would likely need to find used ones from old enterprise computers. None of the crap on Amazon is any good). But this has nothing to do with zfs. It’s doing its job. Ext does not have scrub nor checksumming, so issues are still there, but you don’t have a way to check. You just silenced the smoke alarm, essentially, house is still on fire.

The probability that you uncovered a zfs bug is zero for all intents and purposes.

I would retest and reevaluate your setup. For all I know it’s now silently corrupting your data.

How much ram was on the system? Under 8g you would not get any reasonable performance. Moreover you would likely get hangs and crashes when disks can’t keep up with the IO pressure. When setup correctly, filewalker shall result in zero IO to the magnetic disks, everything shall be served from ram or SSD.

Ext4 has lower (but by not much) resource requirements. You still need to have enough ram to fit metatata to get any reasonable performance.

Rassah · January 10, 2025, 6:43pm

This does not even approach the territory of “large drive or array”.

Larger than 3 or 4. Obviously not a datacenter, but a larger file storage than most people have. Reading and writing is across all 8 drives at the same time btw, with pool structure being Stripe( RaidZ1(8tb, 8tb, 8tb, 8tb), RaidZ1(8tb, 8tb, 8tb, 8tb)) where one of the RaidZ1 drives can fail and it’ll keep working. Did it this way so I only need to replace 4 drives to increase storage.

Then your disks are dying, or you have bad ram, or your power supply is trash,

Yep, thought of that. Changed the ram and tested both. Checked disks with SATA and other tests. No issues, and no issues when just serving files without Storj. Power supply was replaced and wattage tested. It’s providing more than enough (1200 watt ASUS)

It’s very hard to find sata cables that are not abhorrent trash;

Changed those to high quality ones. Then dumped the cheap SATA card and went to SAS, now using high quality SAS cables. Still problems with Storj, none without it.

Ext does not have scrub nor checksumming, so issues are still there, but you don’t have a way to check.

It’s now on a separate single drive, on its own power cable, on its own SATA port (not SAS), and I run scans on it once in a while. No problems so far. No problems with my ZFS pool either.

The probability that you uncovered a zfs bug is zero for all intents and purposes.

I’m not so sure. It’s common to find warnings saying “DO NOT USE ZFS FOR DATABASES!”. Scrubs of my ZFS data consistently return clean now. Before, it would find issues with some random one or two large files, but when I scrubbed again it would find those files to be perfectly fine. This would happen almost every time, since it’s doing Storj stuff while doing monthly scrubbing, until it would lock up after many months. Even after it ended up in read-only mode, all the data was fine and nothing was corrupt when I would back it up and check it. Just the file table would get errors, preventing ZFS from mounting it normally. Possibly something being overwhelmed or not writing right.

How much ram was on the system?

32 gigs. Not much running besides the OS, Storj, and a few media file servers.

As I said, nothing changed hardware-wise. Storj was moved to its own ext4 HDD running on the old SATA cable, and now ZFS storage is fine, and Storj is fine.
Short of changing the Motherboard and CPU there’s not much left to change.

arrogantrabbit · January 11, 2025, 1:11am

This is very interesting. What zfs version did you have and what features were enabled? It would be interesting to figure out what went wrong there.

I have pretty much the same configuration (three 4-drive raidz1 vdevs instead of two) and run two storagenodes per machine. On two different machines(supermicro x9 and x10). One of them also has exactly 32GB of ram. None of them experienced anything remotely like what you describe, even throughout the test data waterfall stage last year. TrueNAS core 13 is the OS on both.

The recommendation against databases on zfs I never heard but it likely stems from higher iops requirements of the redundant vdevs. The solution is to force them to SSD by adjusting small blocks size. This massively helps even with the storj dashboard performance, and not unexpected. Small transactions are best served from SSD, large ones — disks, to hide latency. That’s the indented mode of operation.

mattventura · January 11, 2025, 6:02am

To answer the original question - on HDDs, ZFS is actually quite good in terms of performance. Most of the optimization it does is centered around HDDs.

agente · January 12, 2025, 9:06am

Are we sure that ZFS is more hardware intensive vs ext4 with primarycache=metadata and secondarycache=metadata settings? No waste of resources caching files (storj dont need) just focusing on metadata. If you want to lighten the CPU load you can turn off compression

Alexey · January 12, 2025, 9:21am

Likely on par or better. However, if you would use SSD for LVM, you likely would get the same outcome, or maybe in some cases a little better. The difference could be only if you use some redundancy array. In case of ZFS it would be able to correct a corrupted data.

dmarasoiu · January 19, 2025, 11:02pm

i am using a ZFS DRAID1 (one parity, zero spare - equivalent to Z1 but better i think).
i dont have an extensive period to say how it works.
ZFS has more parameters to look after, but careful tuning makes it very good.
The integrity checks, the ability to self repair are very good.
Performance wise should be on par.

Normally ZFS requires more RAM but in storj case, we dont need dedup, and my memory is actually not hogged up at all.

so yes i am quite satisfied with ZFS so far, I reccomend it, espeically since one can do main storage on HDDs, and use SSD partitions for the filesystem metadata & optionally for the very small pieces (special vdev), and another SSD partition for the SLOG.

This all means that the writes to the rotational drives are batched seq writes.

I did not yet go for L2ARC, since storj does not have much hot data access.
keeping DB on an SSD pool too, and zfs send receive is an easy way to incrementally backup the db also.

nitrobass24 · January 20, 2025, 11:01pm

Im also using ZFS and have my piece storage on the main pool to use my unused but deployed HDD space.
I have the DBs and Badgercache on an SSD

dmarasoiu · January 22, 2025, 8:55pm

About L2ARC, i considered but discarded for now.
I use a DRAID-1 pool (4 devices, 6TB each) (DRAID1 is a bit like Z1),
I use 2 special vdevs (an SSD and an SSD partition also),
I use a SLOG (ZIL) on another SSD partition. This way i can write to disks every minute, not more frequent, to maximize rotational drive lifetime. The SLOG persists data immediately (at least on fsync, with sync=standard in zfs).
The L2ARC needs special tuning and i am not sure is worth on storj.
By default ZFS writes very slowly even tho they are SSD.
You could do a special vdev with files smaller then say 16kb to SSD.
This will put many files on SSD. Even a majority of files (pieces) will occupy a fraction of space since they are so small.

EasyRhino · January 23, 2025, 12:01am

special VDEV on SSD does all the work for you since that holds the metadata. That’s what gets hammered by storj. the L2ARC is just the poor man’s cousin (and doesn’t require redundancy). I forgot to mention the l2arc is configured for metadata only anyway, so after it fills it’s similar to a special vdev for reads only, not writes.

Roxor · January 23, 2025, 3:15am

It’s astonishing how well it works. Maybe not interesting for a single node system… but if you have a few HDDs (or system otherwise tailored for Storj) it’s very worth having a pair of SSDs that you can partition into special-metadata mirror devices (and hold the DBs while they’re at it). They don’t have to be large.

Like… you never see used-space-filewalker in your process lists anymore… because it completes in less than a second.

arrogantrabbit · January 23, 2025, 3:21am

For me it was completing in 10 min; likely due to my anemic under-clocked CPU.

IOPS were peaking at 10k at about 10 second mark, then exponentially subsiding, as stuff got progressively cached in ram.

It’s noteworthy that file-walker walks in a single thread. It makes total sense most of the time, so I have 47 idle cores and saturated 1 core worth of storagenode work.

(Please let no one get ideas to parallelize filewalker! It’s just an observation! It’s not a problem in any way. There is no where to hurry, for vast majority of cases single thread is better!)

I went another way: single pool, pair of PCIE SSDs hold metadata for an entire pool; if I want specific dataset to reside on an SSD exclusively — I crank up the small file size properly to larger or equal than record size.

gingerbread233 · January 23, 2025, 8:29am

Which “recordsize” (Blocksize) do you recommend for a Storj Dedicated ZFS? Default on HDDs is around 4K, ZFS Standard is 128K. Should I set it on my Storj Dataset to 4K, to improve performance?

arrogantrabbit · January 23, 2025, 8:40am

Leave record size at default. Or set it to 1M. If does not matter. Leave compression enabled. Unlike other filesystems, zfs does not waste space. In orher words, if record size is 1M and file is 5k — it will take 5k on disk (roughly).

(If you are referring to block size for various SSDs in your pool — set it to 4K when adding with ashift parameter. But since you mentioned default 128k — it’s a record size. I’m not sure why did you put it in quotes there. It’s actually called record size and has very little to do with device block size)

Generally don’t change any zfs defaults unless you have a very good reason to (I.e. measured and proven bottleneck). The defaults are designed to work well in vast majority of circumstances. Storagenode is not a substantial workload to exhibit any bottlenecks on common hardware.

You may want to review this thread Notes on storage node performance optimization on ZFS

gingerbread233 · March 2, 2025, 3:24pm

I would like to convert to ZFS now. I saw your post of imrpoving ZFS performance. Since I want to merge my personal NAS with data and my Storj “server” with severals small nodes to one “big” server to cut electricity costs, and optimize drive usage. I want to save my personal data on the ZFS array too, where redundancy is important, I’m wondering how much it will slow down my Storj Nodes when using RAIDz2. Since the inbound traffic isn’t that much, I’m wondering, how it will effect the filewaker and other storj tasks, which are more read demanding. Will it “clog” my nodes? I know Storj has an own redundancy, but I want to keep my nodes “save”, and don’t want to rebuild the data, and want to wait the 12 Month cutted earning “phase”.

alpharabbit · March 2, 2025, 4:23pm

I don’t think so. If you are running one node per RAIDZ2 and not filling the pool too much of course.

arrogantrabbit · March 2, 2025, 5:59pm

I would not worry about your stuff slowing down nodes. On the contrary, nodes may slow down your data, but only in the unoptimal pool configurations, that you would want to avoid anyway.

Ensure metadata access is accelerated by some SSD — then nothing will matter, your NAS will be lightning fast, and you won’t notice nodes are even there. Nodes won’t be slowed down — they will run as if you have SSD array pretty much (see below)

Ideally you want special device in the pool (most folks use mirrored enterprise SSDs, there was recent sizing discussion) but L2ARC cache will also work (most folks use cheapest large SSD they can find; it’s not critical and can fail safely). Also keep in mind, you can remove L2ARC and SLOG, but you can’t remove special device (not that you would want to)

Some other guidance:

figure out how many vdevs you need. One huge raidz2 is most likely not the best choice.
raidz2 for any conceivable vdev is an obvious overkill; it’s just a waste of a slot, power, and money, provides diminishing returns at ecxorbitant cost. Note, unlike with other arrays, when you replace disk in the vdev with zfs, you can keep the one being replaced still connected and continue providing redundancy.
if you have any clients generating synchronous IO (not storj, disable sync for storj dataset) spend $10 on $16GB Optane and just stick it there as slog. It will help with that workload (Time Machine is that workload for me)
don’t buy new and/or consumer hardware. But used enterprise disks, SSDs, memory, power supplies, etc, if you care about reliability.
dont skimp on ram. Anywhere between 32 to 64GB shall be fine

gingerbread233 · March 2, 2025, 10:52pm

Thanks for your advice. I my case my node hdds are just 30% filled, so I want to put these nodes in a pool and use the other space left, for my own data, and ditching my nas, so I’ll use the capacity more efficient in this case an RAIDz2 will still provide a better price to performance, since it will be more utilized, and with zfs 2.3.0 on debain 6.12.0+bpo, I can dynamically add drives to the pool to enlarge it in the future. I wanted to use RAIDz2, so if I enlarge the pool and stick a new bigger drive for resilvering in my array, one older drive could still fail and my array will be safe. I planned adding a SSD as L2arc and SLOG. I chatted a bit with my friend perplexity, and it told me slog will safe inbound files less than 1MiB in the cache, and will write it every 5 seconds to the disk as sequential writes which males it more efficient. Since storj mostly writes small files, I think the 1MiB will do the job. I hope SLOG will help my array to perform well. I just want to be sure, before copying all my nodes to my zfs array.