How to determine the required size for ZFS special VDEV?

I made a second dataset for databases with recordsize=16K and special_small_blocks=16K. So all metadata + complete databases should go to special device but no small files from other folders.

Already some results?

Not yet. The zdb runs for ages.

I have intermediate results from another machine: I looked at how much space is allocated on the metadata device, then changed record size of the nodes datastore from 128k to 512k, send/receive/delete old, wait for garbage collection to finish and check the special device allocation again. There was no noticeable difference. The allocation increased consistent with the added about 1 TB of data that happened during this time. The node was around 5TB so this did not replicate your result.

I’m waiting for the garbage collection to complete on the other machine — I’ve canceled send/receive mid-process and now everything transferred so far is in garbage. It takes a while to reclaim space for me to try again.

By the way, since zfs-2.2.0 there is no longer need to jump through hoops: The amount of metadata is reported by zdb tool directly. What does the ZFS Metadata Special Device do? Ā· openzfs/zfs Ā· Discussion #14542 Ā· GitHub

this 5GB/TB would do the trick with LVM metadata pool cache for EXT4

I don’t understand you: 5GB/TB special devs for ZFS-metadata. What has LVM to do with it (unless you create the logical partitions used for this purpose with LVM)?

Ext4 should stay out of the equation in any case here: 1) don’t mix filesystems, 2) LVM cache is just not filesystem specific and will always underperform in comparison with filesystem specific optimizations.

I may use 6GB/TB in the future. I have a 5GB/TB node copying now where the raw pool is like 8.7% full and the metadata device is 8.9% full (so if nothing changes the metadata space will run out before the HDD fills). But this is very sensitive to number of small files… like before the copy completes I may see 50%-HDD/45%-metadata: I’m somewhat at the mercy of randomness.

On other nodes 5GB/TB works. It’s just tight. I’m not a large enough SNO to be able to say anything for sure. And you only really care as the HDD approaches 100%-full.

What is your recordsize?
I found the meta data to be significantly less using a recordsize 512k or higher. And I also set redundant_metadata=some.

My meta data is actually 2.5-3GB per TB stored.

Yeah I saw the same thing at 128… so use 512 now too. I haven’t touched redundant_metadata.

But the speed increase is crazy! I did a ā€œtime ls -lR | wc -lā€ on a satellite blob folder and it said it traversed 13 million files in like 3.5 minutes. I have a couple small nodes and used-space-filewalkers happen very quickly now. Sweet!

1 Like

Did you change it, or was it set at the creation of the pool? Because changing it now, will not affect existing data in the pool. Only future writes/changes.

When I was testing I was making new pools. I’d create them and attach metadata before even creating a directory on them or altering permissions (then untar a sample node into them and watch what happened). Because 5GB/TB worked I’m moving everything over from ext4 now.

I know LVM caching can work. And tailoring L2ARC towards metadata. But having all metadata reads and writes on SSD, 24x7, with no cache to ever ā€˜warm up’… seemed the smarter way to go.

If I paid for SSD space I want it working for me in a tangible way. It should soak up all the filewalker housekeeping IO and leave the HDDs to only handle real ingress/egress.

Also my way of thinking, although I make sure I use two mirrored special devs (so 2x5GB/TB) to prevent one single point of failure in case the SSD deceases.

2 Likes

to chime in, special metadata vdev on SSD also seems like the best performance option for me.

However, i’m not doing it because:

  1. I only have one SSD available so no mirroring for redundancy

  2. even if I DID have mirroring, I have like 7 hard drives, so if they all depended on this one mirrored vdev it sets me up for a catastrophic failure if I mess something up. which I would.

1 Like

You could also consider using a small external SSD for that.

Besides, L2ARC on SSD might also already be a big improvement.

If you have only one HDD attached to your system, I could imagine you’re accepting the risk. Also because SSDs have already been proven more durable than HDDs. But if you’ve got multiple drives, I would really advice to rethink this big one point of failure for baby drives.

Looks like today on every server that hosts nodes I consistently see about 11GB of special device usage per TB of storage (as reported by zpool list -v). This is about 1.1%, a 36x more than commonly discussed ballpark of 0.03%.

On the other hand, if we go by ā€œMetadata Totalā€ reported by ZDB, the percentage is about 0.1%. I’m not sure why there is an order of magnitude discrepancy between metadata total reported by zdb (which is supposed to be a source of truth) and allocated size on the special device.

1 Like

I can second this :+1:

Across nodes my usage for special dev is very close - 10-12GB/TB.

Example, for at couple of servers with a total of ~40TB node data I’m at 450GB special usage.

Dataset settings are:
zfs create -o recordsize=1024k -o compression=lz4 -o atime=off -o xattr=off -o primarycache=all -o secondarycache=metadata -o sync=standard pool/dataset

I had initially built with 5GB/TB… but that may have cut it too close. For example I can see I have a zpool now where the HDD portion shows 91%-full, but the metadata SSDs show 99%-full. So 6GB/TB should work… but if I had to recreate them again today I’d use 7GB/TB.

I guess if hashstore becomes the default… with larger files their will be fewer of them… so special-metadata space should go down?

What’s the point of this? Your metadata is already on SSD, no?

Why is sync not disabled?

99% full is full. That means your special device is not large enough for fit all metadata. So you don’t really know how much of it you have

Is a leftover from previously using L2ARC for metadata caching. With that setup it would reload most of performance enhancing FRU and metadata during a reboot.
You are right no effect in a setup with special dev.

My pool has sync=always because of other workloads, and in other to not inherit that setting, and just respecting requested write method from application.
I also have ZIL for sync writes - so.. No special thoughts behind this :slight_smile:

1 Like

Got it. But storj databases do write synchronously.

And slog does not negate that — it just halves the load on disks.

The effect is that you disallow data to be cached in L2ARC