Notes on storage node performance optimization on ZFS

Roxor · January 4, 2025, 12:17am

Yeah using partitions is pretty sweet! In the special-metadata example you can have a couple SSDs… then carve off pairs of partitions and mirror them for your pools. Like if your first Storj HDD is 10TB… make a pair of 50GB partitions and mirror them as your special-metadata. Next HDD is 18TB? Slice off a pair of 90GB partitions etc.

Using the 5GB/TB guideline… a pair of 1TB SSDs should cover a 24-bay JBOD of 8TB’ish HDDs. Or you get the idea…

foegra · January 4, 2025, 6:05pm

So, You are saying - You could take two partitions and make mirrored metadata vdevs out of them, or You first make a mirror of a whole ssd and take slice of a mirror instead?

BTW - enabling L2ARC for my Storj pool seemed to have reduced amount of my server-grade HDD noise.

Roxor · January 4, 2025, 6:26pm

You’d typically leave both SSDs separate…then use a tool like fdisk/sgdisk to make a partition on each… then use those two partition names in your “zpool add” command and specify them to be mirrored. You don’t need to mirror the entire drives together first.

EasyRhino · January 5, 2025, 9:44pm

yep I’ve done this. I had a single large SSD that wasn’t really being utilized much, and now it has about 8 small partitions that are being used as L2ARCH metadata for 8 nodes.

MarkRose · February 17, 2025, 4:04am

That’s because InnoDB’s default block size is 16 KB. InnoDB is the primary MySQL storage engine.

MarkRose · February 17, 2025, 4:26am

Reading through this topic, has anyone done something like this?

bunch of hard drives
2 PLP SSDs, partitioned
- pairs of 5 GB per TB per spinning rust drive for special vdevs
- pair of partitions to make a mirrored vdev for databases

arrogantrabbit · February 17, 2025, 5:05am

Databases can be forced to an SSD by setting small block size equal or larger to the record size, no need for partitioning disks.

MarkRose · February 17, 2025, 6:15am

And so just put the databases in their own dataset with a small record size? I am not keen to store small files in the special vdevs. The old hardware I’m thinking of using consists of 8x 3 TB ~5900 RPM disks, 16 GB of memory (I might be able to swap motherboards to one with 32 GB), and I don’t want to spend a lot on SSDs.

Roxor · February 17, 2025, 12:39pm

Carving off pairs-of-SSD-partitions as special-metadata mirrors for each Storj HDD works very well. I wouldn’t worry about adding small-files support, unless you have lots of spare flash space… as the file metadata is the important part.

I’d just have one separate mount on SSD and bind/redirect all your databases to directories inside it. No need to mirror: since Storj databases are 100% disposable (and are stats-only).

EasyRhino · February 17, 2025, 10:46pm

Similar:
I have a single SSD that I have partitioned so I can use multiple L2ARCS for each node hard drive.

And I already relocated the storj databases to a SSD via the built in docker options.

Incidentally, I learned something about my own ZFS setup (8 nodes on 8 drives, and one SSD with a L2ARC for metadata).

when I had less RAM allocated to my NAS software in general, about 20GB total, I had about 10GB ZFS ARC cache and all the drives suffered from consistant monderately high utilization. The SSD with the L2ARC wasn’t really helping out that much.

I allocated more RAM, about 30GB total, so that ZFS ARC was about 20GB. And then utilization across all hard drives went way down, but I could see the SSD with more activity.

if I read arc_summary correctly just the headers for the L2ARC data were taking up about 5GB of RAM, so maybe I was constrained on RAM earlier and something bad was happening.

TL;DR: the ZFS L2ARC may indeed require “enough” RAM allocated to the regular ARC cache in order before L2ARC can help that much.

MarkRose · February 21, 2025, 7:41am

How big are databases per TB, generally? I think that’s the last thing I need to know for my hardware configuration.

Roxor · February 21, 2025, 1:23pm

Storj databases don’t use much, but seem to be all over the place. I 've seen some as high as 1.5GB/used-TB. If I were going to create a partition to handle the dbs for multiple nodes… I’d use 3x TB-raw-space (as GB). So if it was something like 10x10TB HDD… that’s 100TB … so I’d make a 300GB SSD partition to hold those 10 node databases.

It would probably get no more than 25% used… but then I would never think of it again.

Edit: If you’re tight on space… well… those databases only hold stats… so can be erased at any time and won’t affect your payouts (just the numbers in the GUI until the next month begins)

MarkRose · February 21, 2025, 4:30pm

Thank you! Yes, I will be tight on SSD space. I’m mostly using old hardware (8x 3 TB spinning rust), but bought a couple used 150GB DCS3520 drives. Now I see I probably should have gone for slightly bigger drives. I might see if I can throw an NVMe adaptor into the system and find something suitable for databases, since it doesn’t matter if it dies.

Which gets more write IOPS, the special devs or the databases? They’re both effectively databases for the same data, so I’m thinking they should be on the same order of magnitude.

Ambifacient · February 21, 2025, 6:28pm

With the new way of managing the piece expirations, my database usage is around 50MB per node. Even my oldest node is only 40MB.

150GB SSD in a redundant configuration for 8 nodes should be more than enough.

If you are using badger cache with piecestore, you’ll need to use more space. Maybe 500MB per TB stored.

MarkRose · February 21, 2025, 10:22pm

Ah, so almost nothing then. Nice.

Would I even want a badger cache if the metadata is on SSDs?

Ambifacient · February 21, 2025, 10:34pm

If you are using piecestore and have the extra space and like a decent amount of RAM I would personally recommend doing so. If you are using the new hashstore backend then no, it has no effect.

striker43 · February 21, 2025, 10:56pm

No, if you have Metadata on SSD, filewalkers are finishing in seconds or maybe a few minutes. No need for badger

MarkRose · February 21, 2025, 11:20pm

Thank you for confirming.

Now with the new databases being so small, I could make my life simpler by making a separate dataset for each database, then setting special_small_blocks=512 on that dataset, and it should go to the special device SSDs, correct?