How much "slower" is ZFS?

pvmove is strictly better than dd for moving, as then I can do a live copy. Plus, I’m also getting access to tools like lvmcache when necessary.

1 Like

I have had a horrible experience with ZFS on Storj. I have a large drive array of 8 drives, about 40TB of storage total, and figured I could use a lot of extra space to make money. I had the same idea too, that if a drive dies, I don’t lose my node and don’t have to start from scratch, earning less for the first 6 months.
Storj seemed to work okay, but over time zpool scrub would find issues with files, and eventually the zpool would just completely lock up, where I would have to mount it in read only mode, back everything up off of it, and rebuild it from scratch.
This happened about 3 times over a bit over a year. I’ve changed SATA cards to more advanced ones, I’ve added a better more powerful power supply, changed drives around, nothing helped. Storj just seems to overwhelm ZFS pools for some reason.
Since started it from scratch using just one of my spare drives running ext4, Storj had no issues, and my zpool has no issues.
I have no idea what’s actually causing it, and maybe the devs will figure something out, but for now I don’t trust it. Last this happened was last summer. Also, copying Storj pool from read only pool to a backup would take a week for a 8TB sized one due to millions of files.

This does not even approach the territory of “large drive or array”.

Then your disks are dying, or you have bad ram, or your power supply is trash, since you have already replaced the data cables (hopefully twice, because you still can replace bad cable with bad cable. It’s very hard to find sata cables that are not abhorrent trash; you would likely need to find used ones from old enterprise computers. None of the crap on Amazon is any good). But this has nothing to do with zfs. It’s doing its job. Ext does not have scrub nor checksumming, so issues are still there, but you don’t have a way to check. You just silenced the smoke alarm, essentially, house is still on fire.

The probability that you uncovered a zfs bug is zero for all intents and purposes.

I would retest and reevaluate your setup. For all I know it’s now silently corrupting your data.

How much ram was on the system? Under 8g you would not get any reasonable performance. Moreover you would likely get hangs and crashes when disks can’t keep up with the IO pressure. When setup correctly, filewalker shall result in zero IO to the magnetic disks, everything shall be served from ram or SSD.

Ext4 has lower (but by not much) resource requirements. You still need to have enough ram to fit metatata to get any reasonable performance.

1 Like

This does not even approach the territory of “large drive or array”.

Larger than 3 or 4. Obviously not a datacenter, but a larger file storage than most people have. Reading and writing is across all 8 drives at the same time btw, with pool structure being Stripe( RaidZ1(8tb, 8tb, 8tb, 8tb), RaidZ1(8tb, 8tb, 8tb, 8tb)) where one of the RaidZ1 drives can fail and it’ll keep working. Did it this way so I only need to replace 4 drives to increase storage.

Then your disks are dying, or you have bad ram, or your power supply is trash,

Yep, thought of that. Changed the ram and tested both. Checked disks with SATA and other tests. No issues, and no issues when just serving files without Storj. Power supply was replaced and wattage tested. It’s providing more than enough (1200 watt ASUS)

It’s very hard to find sata cables that are not abhorrent trash;

Changed those to high quality ones. Then dumped the cheap SATA card and went to SAS, now using high quality SAS cables. Still problems with Storj, none without it.

Ext does not have scrub nor checksumming, so issues are still there, but you don’t have a way to check.

It’s now on a separate single drive, on its own power cable, on its own SATA port (not SAS), and I run scans on it once in a while. No problems so far. No problems with my ZFS pool either.

The probability that you uncovered a zfs bug is zero for all intents and purposes.

I’m not so sure. It’s common to find warnings saying “DO NOT USE ZFS FOR DATABASES!”. Scrubs of my ZFS data consistently return clean now. Before, it would find issues with some random one or two large files, but when I scrubbed again it would find those files to be perfectly fine. This would happen almost every time, since it’s doing Storj stuff while doing monthly scrubbing, until it would lock up after many months. Even after it ended up in read-only mode, all the data was fine and nothing was corrupt when I would back it up and check it. Just the file table would get errors, preventing ZFS from mounting it normally. Possibly something being overwhelmed or not writing right.

How much ram was on the system?

32 gigs. Not much running besides the OS, Storj, and a few media file servers.

As I said, nothing changed hardware-wise. Storj was moved to its own ext4 HDD running on the old SATA cable, and now ZFS storage is fine, and Storj is fine.
Short of changing the Motherboard and CPU there’s not much left to change.

4 Likes

This is very interesting. What zfs version did you have and what features were enabled? It would be interesting to figure out what went wrong there.

I have pretty much the same configuration (three 4-drive raidz1 vdevs instead of two) and run two storagenodes per machine. On two different machines(supermicro x9 and x10). One of them also has exactly 32GB of ram. None of them experienced anything remotely like what you describe, even throughout the test data waterfall stage last year. TrueNAS core 13 is the OS on both.

The recommendation against databases on zfs I never heard but it likely stems from higher iops requirements of the redundant vdevs. The solution is to force them to SSD by adjusting small blocks size. This massively helps even with the storj dashboard performance, and not unexpected. Small transactions are best served from SSD, large ones — disks, to hide latency. That’s the indented mode of operation.

1 Like

To answer the original question - on HDDs, ZFS is actually quite good in terms of performance. Most of the optimization it does is centered around HDDs.

2 Likes

Are we sure that ZFS is more hardware intensive vs ext4 with primarycache=metadata and secondarycache=metadata settings? No waste of resources caching files (storj dont need) just focusing on metadata. If you want to lighten the CPU load you can turn off compression

Likely on par or better. However, if you would use SSD for LVM, you likely would get the same outcome, or maybe in some cases a little better. The difference could be only if you use some redundancy array. In case of ZFS it would be able to correct a corrupted data.

1 Like

i am using a ZFS DRAID1 (one parity, zero spare - equivalent to Z1 but better i think).
i dont have an extensive period to say how it works.
ZFS has more parameters to look after, but careful tuning makes it very good.
The integrity checks, the ability to self repair are very good.
Performance wise should be on par.

Normally ZFS requires more RAM but in storj case, we dont need dedup, and my memory is actually not hogged up at all.

so yes i am quite satisfied with ZFS so far, I reccomend it, espeically since one can do main storage on HDDs, and use SSD partitions for the filesystem metadata & optionally for the very small pieces (special vdev), and another SSD partition for the SLOG.

This all means that the writes to the rotational drives are batched seq writes.

I did not yet go for L2ARC, since storj does not have much hot data access.
keeping DB on an SSD pool too, and zfs send receive is an easy way to incrementally backup the db also.

Im also using ZFS and have my piece storage on the main pool to use my unused but deployed HDD space.
I have the DBs and Badgercache on an SSD

About L2ARC, i considered but discarded for now.
I use a DRAID-1 pool (4 devices, 6TB each) (DRAID1 is a bit like Z1),
I use 2 special vdevs (an SSD and an SSD partition also),
I use a SLOG (ZIL) on another SSD partition. This way i can write to disks every minute, not more frequent, to maximize rotational drive lifetime. The SLOG persists data immediately (at least on fsync, with sync=standard in zfs).
The L2ARC needs special tuning and i am not sure is worth on storj.
By default ZFS writes very slowly even tho they are SSD.
You could do a special vdev with files smaller then say 16kb to SSD.
This will put many files on SSD. Even a majority of files (pieces) will occupy a fraction of space since they are so small.

special VDEV on SSD does all the work for you since that holds the metadata. That’s what gets hammered by storj. the L2ARC is just the poor man’s cousin (and doesn’t require redundancy). I forgot to mention the l2arc is configured for metadata only anyway, so after it fills it’s similar to a special vdev for reads only, not writes.

2 Likes

It’s astonishing how well it works. Maybe not interesting for a single node system… but if you have a few HDDs (or system otherwise tailored for Storj) it’s very worth having a pair of SSDs that you can partition into special-metadata mirror devices (and hold the DBs while they’re at it). They don’t have to be large.

Like… you never see used-space-filewalker in your process lists anymore… because it completes in less than a second.

1 Like

For me it was completing in 10 min; likely due to my anemic under-clocked CPU.

IOPS were peaking at 10k at about 10 second mark, then exponentially subsiding, as stuff got progressively cached in ram.

It’s noteworthy that file-walker walks in a single thread. It makes total sense most of the time, so I have 47 idle cores and saturated 1 core worth of storagenode work.

(Please let no one get ideas to parallelize filewalker! It’s just an observation! It’s not a problem in any way. There is no where to hurry, for vast majority of cases single thread is better!)

I went another way: single pool, pair of PCIE SSDs hold metadata for an entire pool; if I want specific dataset to reside on an SSD exclusively — I crank up the small file size properly to larger or equal than record size.

Which “recordsize” (Blocksize) do you recommend for a Storj Dedicated ZFS? Default on HDDs is around 4K, ZFS Standard is 128K. Should I set it on my Storj Dataset to 4K, to improve performance?

Leave record size at default. Or set it to 1M. If does not matter. Leave compression enabled. Unlike other filesystems, zfs does not waste space. In orher words, if record size is 1M and file is 5k — it will take 5k on disk (roughly).

(If you are referring to block size for various SSDs in your pool — set it to 4K when adding with ashift parameter. But since you mentioned default 128k — it’s a record size. I’m not sure why did you put it in quotes there. It’s actually called record size and has very little to do with device block size)

Generally don’t change any zfs defaults unless you have a very good reason to (I.e. measured and proven bottleneck). The defaults are designed to work well in vast majority of circumstances. Storagenode is not a substantial workload to exhibit any bottlenecks on common hardware.

You may want to review this thread Notes on storage node performance optimization on ZFS

1 Like