How to determine the required size for ZFS special VDEV?

I made a second dataset for databases with recordsize=16K and special_small_blocks=16K. So all metadata + complete databases should go to special device but no small files from other folders.

Already some results?

Not yet. The zdb runs for ages.

I have intermediate results from another machine: I looked at how much space is allocated on the metadata device, then changed record size of the nodes datastore from 128k to 512k, send/receive/delete old, wait for garbage collection to finish and check the special device allocation again. There was no noticeable difference. The allocation increased consistent with the added about 1 TB of data that happened during this time. The node was around 5TB so this did not replicate your result.

I’m waiting for the garbage collection to complete on the other machine — I’ve canceled send/receive mid-process and now everything transferred so far is in garbage. It takes a while to reclaim space for me to try again.

By the way, since zfs-2.2.0 there is no longer need to jump through hoops: The amount of metadata is reported by zdb tool directly. What does the ZFS Metadata Special Device do? Ā· openzfs/zfs Ā· Discussion #14542 Ā· GitHub

this 5GB/TB would do the trick with LVM metadata pool cache for EXT4

I don’t understand you: 5GB/TB special devs for ZFS-metadata. What has LVM to do with it (unless you create the logical partitions used for this purpose with LVM)?

Ext4 should stay out of the equation in any case here: 1) don’t mix filesystems, 2) LVM cache is just not filesystem specific and will always underperform in comparison with filesystem specific optimizations.

I may use 6GB/TB in the future. I have a 5GB/TB node copying now where the raw pool is like 8.7% full and the metadata device is 8.9% full (so if nothing changes the metadata space will run out before the HDD fills). But this is very sensitive to number of small files… like before the copy completes I may see 50%-HDD/45%-metadata: I’m somewhat at the mercy of randomness.

On other nodes 5GB/TB works. It’s just tight. I’m not a large enough SNO to be able to say anything for sure. And you only really care as the HDD approaches 100%-full.

What is your recordsize?
I found the meta data to be significantly less using a recordsize 512k or higher. And I also set redundant_metadata=some.

My meta data is actually 2.5-3GB per TB stored.

Yeah I saw the same thing at 128… so use 512 now too. I haven’t touched redundant_metadata.

But the speed increase is crazy! I did a ā€œtime ls -lR | wc -lā€ on a satellite blob folder and it said it traversed 13 million files in like 3.5 minutes. I have a couple small nodes and used-space-filewalkers happen very quickly now. Sweet!

Did you change it, or was it set at the creation of the pool? Because changing it now, will not affect existing data in the pool. Only future writes/changes.

When I was testing I was making new pools. I’d create them and attach metadata before even creating a directory on them or altering permissions (then untar a sample node into them and watch what happened). Because 5GB/TB worked I’m moving everything over from ext4 now.

I know LVM caching can work. And tailoring L2ARC towards metadata. But having all metadata reads and writes on SSD, 24x7, with no cache to ever ā€˜warm up’… seemed the smarter way to go.

If I paid for SSD space I want it working for me in a tangible way. It should soak up all the filewalker housekeeping IO and leave the HDDs to only handle real ingress/egress.

Also my way of thinking, although I make sure I use two mirrored special devs (so 2x5GB/TB) to prevent one single point of failure in case the SSD deceases.

to chime in, special metadata vdev on SSD also seems like the best performance option for me.

However, i’m not doing it because:

  1. I only have one SSD available so no mirroring for redundancy

  2. even if I DID have mirroring, I have like 7 hard drives, so if they all depended on this one mirrored vdev it sets me up for a catastrophic failure if I mess something up. which I would.

You could also consider using a small external SSD for that.

Besides, L2ARC on SSD might also already be a big improvement.

If you have only one HDD attached to your system, I could imagine you’re accepting the risk. Also because SSDs have already been proven more durable than HDDs. But if you’ve got multiple drives, I would really advice to rethink this big one point of failure for baby drives.

Looks like today on every server that hosts nodes I consistently see about 11GB of special device usage per TB of storage (as reported by zpool list -v). This is about 1.1%, a 36x more than commonly discussed ballpark of 0.03%.

On the other hand, if we go by ā€œMetadata Totalā€ reported by ZDB, the percentage is about 0.1%. I’m not sure why there is an order of magnitude discrepancy between metadata total reported by zdb (which is supposed to be a source of truth) and allocated size on the special device.

I can second this :+1:

Across nodes my usage for special dev is very close - 10-12GB/TB.

Example, for at couple of servers with a total of ~40TB node data I’m at 450GB special usage.

Dataset settings are:
zfs create -o recordsize=1024k -o compression=lz4 -o atime=off -o xattr=off -o primarycache=all -o secondarycache=metadata -o sync=standard pool/dataset

I had initially built with 5GB/TB… but that may have cut it too close. For example I can see I have a zpool now where the HDD portion shows 91%-full, but the metadata SSDs show 99%-full. So 6GB/TB should work… but if I had to recreate them again today I’d use 7GB/TB.

I guess if hashstore becomes the default… with larger files their will be fewer of them… so special-metadata space should go down?

What’s the point of this? Your metadata is already on SSD, no?

Why is sync not disabled?

99% full is full. That means your special device is not large enough for fit all metadata. So you don’t really know how much of it you have

Is a leftover from previously using L2ARC for metadata caching. With that setup it would reload most of performance enhancing FRU and metadata during a reboot.
You are right no effect in a setup with special dev.

My pool has sync=always because of other workloads, and in other to not inherit that setting, and just respecting requested write method from application.
I also have ZIL for sync writes - so.. No special thoughts behind this :slight_smile:

Got it. But storj databases do write synchronously.

And slog does not negate that — it just halves the load on disks.

The effect is that you disallow data to be cached in L2ARC