Notes on storage node performance optimization on ZFS

Pentium100 · June 25, 2024, 3:42pm

I don’t think you could even start a pool with a missing vdev to recover some data. Bu yes your node would be disqualified for two reasons:

The pool would not start.
Even if you managed to start it, it is likely that you would have lots of damaged files instead of a lot of intact files and some missing files. ZFS can write a large file across all drives, so there would be a piece of it missing. At worst you would have all files with a piece missing.

Oh, and the Storj network would repair the data just fine, but your node would be DQ.

pangolin · June 25, 2024, 4:15pm

This is a joke, isn’t it?

Alexey · June 28, 2024, 8:31am

I’m sorry, but why we are talking about (not used) MySQL here? These databases are SQLite, basically a one file per DB…

Alexey · June 28, 2024, 8:33am

It should be recovered by Storj, but all recovered pieces will go to any other node but yours. So, kind of “yes” and “no” for your setup. Your node(s) will be DQ permanently.

brainstorm · June 28, 2024, 11:59am

I think there is some misunderstanding here, the recommendation is to not run more than 12 disks in one VDEV.
A pool can have as many vdevs as necessary.

EasyRhino · July 11, 2024, 11:56pm

I had a random thought of a ZFS property that might improve performance

redundant_metadata=some (instead of the default “all”)

In theory this makes the files more fragile because an entire file could be lost if one block of the drive is bad. But in the scheme of storj, losing one file may not even be noticeable.

It seems like it might speed up metadata heavy operations. Maybe.

I have it turned on my drives and they are working normally, but I can’t evaluate if it’s really any faster.

JWvdV · July 12, 2024, 2:44am

I would say that metadata should be mirrored. So this problem already is mitigated by that.

flwstern · July 12, 2024, 2:56am

no no! best performance im running raid0 500 drives!

Alexey · July 12, 2024, 8:43am

what? Are you serious?
With the risk of the one drive failure and everything is lost? I do not believe you, I’m sorry…

EasyRhino · July 12, 2024, 6:18pm

Yes I’m pretty sure the raid 0 x500 drives was sarcasm Alexey

As for the redundant-metadata=some, I think the idea with zfs is the metadata really isn’t redundant any more. I’ve used the setting for about 24 hours now and haven’t really noticed a huge different in my life.

JWvdV · July 19, 2024, 5:27pm

Doesn’t sound like a profound research project. But since you don’t mention a complete rebuild of meta data like a send-recv operation, I think this is quite useless.

Roxor · July 26, 2024, 9:26pm

(dragging my response to this over here: as mods are spicy about off-topic comments today… )

So it looks like used-space-filewalker started at 17:42… and all four satellites were completed by 18:03. So 21 minutes for around 1TB is probably 3-4million files: very impressive! And thank you for showing you logs

foegra · January 3, 2025, 10:27pm

question here:
I’ve got Seagate Exos X18, which is pretty loud. Which type of additional VDEV - could reduce IO and make that drive more silent? When I’m looking at statistics - there are more writes than reads, so ZIL could help in this case. Or - am I wrong?

Roxor · January 3, 2025, 10:39pm

A ZFS Intent Log is only used for synchronous writes, and a couple months ago Storj nodes were switched to use async: so adding a ZIL won’t help. Eventually writes must make it to disk, and they tend to be small, so you may not to be able to reduce overall IO much… but you can still use a SSD with ZFS to speed up housekeeping tasks: like used-space-filewalker.

Quite a few here have used a ZFS “special metadata” device: so the millions of filenames/sizes can quickly be queried from SSD instead of hitting your HDD: which reduce most Storj internal tasks to seconds (instead of hours/days). I think 5GB/TB is a common config: so even small SSD partitions/devices make a difference. Search for more ZFS posts in the forum: Good Luck!

alpharabbit · January 3, 2025, 10:54pm

Or use a ssd as l2arc cache and set it to metadata only. The benefit is very much the same but you can remove it any time and redundancy is not needed.

Roxor · January 3, 2025, 11:02pm

Over time it becomes similar, yes. First L2ARC is only filling itself from evictions from ARC… which is purposely slow so can take a long time (whereas special-metadata has everything, 100% of the time). And by default L2ARC doesn’t survive reboots… so will take time to warm again each time you restart.

You can enable persistent L2ARC now… but that can really slow down boot times. But yes you want to mirror special-metadata for durability, while L2ARC is disposable. In all cases if metadata is in ARC it will be used from there first anyways. RAM always wins

alpharabbit · January 3, 2025, 11:39pm

But it doesn’t survive reboots either…

foegra · January 3, 2025, 11:55pm

Thank You for reply!

It does, actually, starting from Truenas Scale 24.04 (maybe earlier) - it’s default behavior to have L2ARC persistent. L2ARC | TrueNAS Documentation Hub

Metadata VDEV with 2 SDD’s does not sound reasonable in my opinion.
If only i could use same SDD for multiple pools as L2ARC…

alpharabbit · January 4, 2025, 12:00am

TrueNAS GUI doesn’t allow it but this is not a zfs limitation. If using cli you can assign a partition or even a file as l2arc.

foegra · January 4, 2025, 12:01am

Amazing! I will look into that