I made a second dataset for databases with recordsize=16K and special_small_blocks=16K. So all metadata + complete databases should go to special device but no small files from other folders.
I have intermediate results from another machine: I looked at how much space is allocated on the metadata device, then changed record size of the nodes datastore from 128k to 512k, send/receive/delete old, wait for garbage collection to finish and check the special device allocation again. There was no noticeable difference. The allocation increased consistent with the added about 1 TB of data that happened during this time. The node was around 5TB so this did not replicate your result.
Iām waiting for the garbage collection to complete on the other machine ā Iāve canceled send/receive mid-process and now everything transferred so far is in garbage. It takes a while to reclaim space for me to try again.
I donāt understand you: 5GB/TB special devs for ZFS-metadata. What has LVM to do with it (unless you create the logical partitions used for this purpose with LVM)?
Ext4 should stay out of the equation in any case here: 1) donāt mix filesystems, 2) LVM cache is just not filesystem specific and will always underperform in comparison with filesystem specific optimizations.
I may use 6GB/TB in the future. I have a 5GB/TB node copying now where the raw pool is like 8.7% full and the metadata device is 8.9% full (so if nothing changes the metadata space will run out before the HDD fills). But this is very sensitive to number of small files⦠like before the copy completes I may see 50%-HDD/45%-metadata: Iām somewhat at the mercy of randomness.
On other nodes 5GB/TB works. Itās just tight. Iām not a large enough SNO to be able to say anything for sure. And you only really care as the HDD approaches 100%-full.
Yeah I saw the same thing at 128⦠so use 512 now too. I havenāt touched redundant_metadata.
But the speed increase is crazy! I did a ātime ls -lR | wc -lā on a satellite blob folder and it said it traversed 13 million files in like 3.5 minutes. I have a couple small nodes and used-space-filewalkers happen very quickly now. Sweet!
Did you change it, or was it set at the creation of the pool? Because changing it now, will not affect existing data in the pool. Only future writes/changes.
When I was testing I was making new pools. Iād create them and attach metadata before even creating a directory on them or altering permissions (then untar a sample node into them and watch what happened). Because 5GB/TB worked Iām moving everything over from ext4 now.
I know LVM caching can work. And tailoring L2ARC towards metadata. But having all metadata reads and writes on SSD, 24x7, with no cache to ever āwarm upā⦠seemed the smarter way to go.
If I paid for SSD space I want it working for me in a tangible way. It should soak up all the filewalker housekeeping IO and leave the HDDs to only handle real ingress/egress.
Also my way of thinking, although I make sure I use two mirrored special devs (so 2x5GB/TB) to prevent one single point of failure in case the SSD deceases.
to chime in, special metadata vdev on SSD also seems like the best performance option for me.
However, iām not doing it because:
I only have one SSD available so no mirroring for redundancy
even if I DID have mirroring, I have like 7 hard drives, so if they all depended on this one mirrored vdev it sets me up for a catastrophic failure if I mess something up. which I would.
You could also consider using a small external SSD for that.
Besides, L2ARC on SSD might also already be a big improvement.
If you have only one HDD attached to your system, I could imagine youāre accepting the risk. Also because SSDs have already been proven more durable than HDDs. But if youāve got multiple drives, I would really advice to rethink this big one point of failure for baby drives.
Looks like today on every server that hosts nodes I consistently see about 11GB of special device usage per TB of storage (as reported by zpool list -v). This is about 1.1%, a 36x more than commonly discussed ballpark of 0.03%.
On the other hand, if we go by āMetadata Totalā reported by ZDB, the percentage is about 0.1%. Iām not sure why there is an order of magnitude discrepancy between metadata total reported by zdb (which is supposed to be a source of truth) and allocated size on the special device.
I had initially built with 5GB/TB⦠but that may have cut it too close. For example I can see I have a zpool now where the HDD portion shows 91%-full, but the metadata SSDs show 99%-full. So 6GB/TB should work⦠but if I had to recreate them again today Iād use 7GB/TB.
I guess if hashstore becomes the default⦠with larger files their will be fewer of them⦠so special-metadata space should go down?
Is a leftover from previously using L2ARC for metadata caching. With that setup it would reload most of performance enhancing FRU and metadata during a reboot.
You are right no effect in a setup with special dev.
My pool has sync=always because of other workloads, and in other to not inherit that setting, and just respecting requested write method from application.
I also have ZIL for sync writes - so.. No special thoughts behind this