Best Record size for zfs

arrogantrabbit · June 22, 2023, 5:51am

It’s somewhat counterintuitive but very logical. The issue here is not space but random IO. Every read from HDD involves seek latency (huge) and transfer time (depends on object size). For small objects seek greatly exeeds transfer time. Moreover, every file access involves metadata read — another small object fetch.

If you have cache — yes, subsequent reads will be fetched from the cache, provided you have enough ram to fully cache metadata, but the first read must come from disk. Furthermore, looking at my node, number of chunks that end up repeatedly read is very small (but those that do get repeatedly read are read a lot of times).

With special device you would send all metadata and small files (say, under 4K) to SSD and leave only large files on HDD. That way small files don’t have latency hit, metadata does not contribute extra IO to hdd, and disks are left doing whatever they do best — store large files, where seek time is much smaller compared to transfer time, and therefore is less impactful.

Rule of thumb is special device should be 0.3% of pool size. For 10TB pool it’s 30GB. If we also want to store say 1M of small under 4K files — it’s another sub-4GB of space. 50GB SSD completely covers the needs of 10TB pool and completely eliminates random IO associated with metadata lookup and small files. This also absolves you from needing to cache that — so you have larger cache remaining for actual data (albeit it may not be even needed at this point).

Here is a thread with stats from my node on chunks re-use and other performance related tidbits: Notes on storage node performance optimization on ZFS

Record size indeed leave at default. ZFS with compression enabled will not waste space.