ZFS optimizations for usage with storagenode

d4rk4 · August 2, 2023, 10:13am

Avoid using ZFS on a single HDD drive or mirror as it can lead to slower performance due to fragmentation. You can effectively operate SMR drives in RAIDZ groups consisting of 4 drives. You’ll likely notice a satisfactory performance after establishing 3-4 RAIDZ groups of SMR drives. Shucked SMRs are a suitable choice for large ZFS setups. However, it’s crucial to adjust the record size to 1M and store metadata on separate SSDs. For more detailed information, please refer to this link: Post pictures of your storagenode rig(s) - #648 by d4rk4.

deathlessdd · August 2, 2023, 12:06pm

Shouldn’t it get increased performance though from zfs cache though.

Mircoxi · August 2, 2023, 12:23pm

Only on reads, and only if it’s frequently accessed data - the biggest speed up for something like Storj would be a special vdev on SSDs imo. This wouldn’t help with writes though - ZFS doesn’t have a write cache, though SLOG is often confused for being one.

deathlessdd · August 2, 2023, 12:25pm

Oh I dont use it for storj… I had to find a different use for my SMR drives which im just using it for VMs, And I do believe it speeds up reads and writes.

BrightSilence · August 2, 2023, 12:39pm

Sure, it’s not exactly a write cache in the traditional sense, but it still speeds up sync writes.

deathlessdd · August 2, 2023, 12:49pm

I just want to point out these speeds wouldnt be possible without zfs just a simple hd speed test.

arrogantrabbit · August 2, 2023, 4:34pm

Please read about transaction groups: ZFS fundamentals: transaction groups – Adam Leventhal's blog

Transactions are assembled in memory and written at once. That’s a form of caching for you.

Mircoxi · August 2, 2023, 4:50pm

My understanding from the Level1Techs forums and the TrueNAS forums over the years was that most people don’t really consider transaction groups to be a proper write cache, especially given the volatality of data stored in memory. It’s a psuedo-cache (that actively throttles you if you try and write too much before it flushes data to disk). The context was about SLOG anyway, which definitely isn’t a cache.

arrogantrabbit · August 2, 2023, 5:43pm

It depends on what to call a “proper write cache”. If the purpose of implementing a cache is to prevent bothering disks at every little IO and instead batch them together and write as a big clump — then transaction groups accomplish this. By adjusting the group size to the array performance you can achieve quite an excellent load balancing and smoothing.

Of course if the expectation of write cashing is to be able to write 72TB to an SSD during the day and then get it transferred to HDD at night — then it isn’t it, and this is quite an obscure usecase: most people want performance on smaller than a day granularity.

So as long as your disk subsystem is capable of handling batches of more or less sequential IO, sustained performance improvement facilitated by batching provides sufficient uplift, that can be scaled up indefinitely by increasing group size and adding vdevs to sustain the writes.

Definitely agree. And for storagenode usecase is completely unnecessary: cost of data loss is negligible and cost of database corruption due to async writes is …. zero.

d4rk4 · August 2, 2023, 8:21pm

While it’s true that ZFS can offer some performance benefits, it’s important to consider the potential downsides when using it on a single HDD disk.

ZFS is a powerful file system and volume manager, but it can lead to increased fragmentation over time, especially when used on a single disk. As fragmentation grows, so does latency, which can significantly impact the performance of your VMs. This might not be immediately noticeable, but after about a year of use, you might find your VMs experiencing high IO wait times.

To mitigate these issues, it’s generally recommended to use ZFS in a RAIDZ configuration when working with HDDs. RAIDZ is a data/parity distribution scheme like RAID 5, but uses dynamic stripe width. Every block is its own RAID stripe, regardless of blocksize, resulting in every RAIDZ I/O being a full-stripe write. This can help to maintain performance and reduce the impact of fragmentation.

I hope this information is helpful and leads to a smoother operation of your VMs.

arrogantrabbit · August 2, 2023, 8:51pm

SMR drives cannot speed up anything. Ever. The only thing they improve is cost per TB data stored at the expense of:

horrendous write speed
horrendous IO limits (besides doing your writes it needs to reshuffle data to recover from performance optimizations it had to do)
power failure during writing a file today can corrupt another seemingly unrelated file you wrote a year ago and never touched since. ( Just like it is with modern SSDs. But without any performance benefits in return).

SMR drives are only suitable for slow data accumulation and occasional small reads, in non-redundant configurations. I.e. as a cheap USB drive to sell to uninformed consumers as a great deal — look 8TB backup drive for $45! — to pretend that they have data backup.

If you have above requirements and your usecase is not the one described above — throw them away. They are e-waste from the date of manufacture.

deathlessdd · August 2, 2023, 8:54pm

You need to read the full context, I do not use SMR drives for speed. I have nvme drives for speed.

arrogantrabbit · August 2, 2023, 8:55pm

I’ve read the post three times… sorry if I still misunderstood.

What does “it” stand for here?

deathlessdd · August 2, 2023, 8:59pm

Refering to a post about ZFS.

SGC · August 3, 2023, 9:02am

SMR is bad for sustained write IOPS.

RAID reduces write IOPS because all disks are written at the same time, and thus leads to less IOPS on the same hardware, if compared to using the disks individually.

Recommending consumer SMR HDDs for any RAID is just a bad recommendation.

ZFS has advantage and disadvantages, no matter how its run…
running ZFS for Storj on a single HDD isn’t a great idea, as many of the advantages of ZFS is negated by the limited hardware.

running 1M recordsizes will most likely cause caching issues.

ofc all of this is dependent on how much one strains the hardware employed.
its easy to keep something running if the hardware resources are lightly used.

the ZFS defaults, are the defaults for very good reasons, and should be used in most cases.
personally i find the optimal recordsizes being 64K to reduce fragmentation.