The file size is not what dictates optimal recordsize; how data is written is. But see below…
The real reason is because I shoud have gone to sleep and not mess with crap at 2 AM, that’s why.
Writes:
So, here what was my thought process. I assumned (which, today’s me realized was a bullshit assumption) that hastore randomly writes small pieces to random logs. ZFS coalesces transactions into groupsp and writes all at once, but if a lot of files get a lot of small appends read-modify-write amplification becomes a concern. So smaller record size helps.
However, this is not what hashstore does. Hashstore (IIUC) writes data to a few actice logs (selected by TTL), and already groups and coalesces writes. So, there are no small writes at all. Large record size shall be appropriate.
Reads:
Storagenode will read pieces probably predominantly randomly, with maybe a few pieces repeated. The repeated once will get eventually cached, so they are not a factor. Random small reads (meaning no spatial locality) will experience read amplification: even if the app requests 16k block, the whole recordsize worth of data will be fetched.
How important it is - i’d argue not much. Maybe it will add a few ms of read latency. But we already have huge latency reading from HDD anyway. Quadrupling of read bandwidth from disks also won’t matter – because seek latency dominates, we’ll probably go from 10MBps to 40MBps – not a factor at all.
And yet, downsize of small record size is a lot more bookkeeping on special device.
Caching is also won’t be affected much-- there is usualy not many hot pieces.
So, it seems we shall keep recordsize at default and don’t mess with things that don’t need to be messed with.
Can you check what are actual append sizes that hasshtore sees do you obsver? (I can’t, because of in-progress migration, everythign will be super sequential)