Massive reads on startup

Toyoo · June 5, 2023, 12:31am

The plots you have prepared are for that device with a huge ARC. With a smaller ARC you’d have significant cache contention.

12 TB of node data is, taking the numbers from my nodes, around 32M files. Even if you assume a best case general-purpose metadata of, let say, 300 bytes per file (sorry, I don’t know the actual number for ZFS, I believe this would be a good estimate if you want a general file system with modern semantics like ACLs, extended attributes and such), this is almost 10 GB of metadata to cache. Careful setup of ext4 can go down to ~170 bytes per file while sacrificing some features. So, sorry, but I don’t really believe you can do that in 3 GB. I suspect you have disabled the disk usage estimation and only count the trash file walker, probably after some large removals.

I actually used to think that a carefully formatted ext4 would be good enough. Then my nodes grew to >25 TB on an old machine that has only 16 GB of RAM and cannot be extended. Had to add an SSD at that point for caching metadata, otherwise my drives were absolutely trashed. I spent time on learning ext4 data structures and figured we could do quite a lot better than that.

A purpose-built file system could do the job in ~44 bytes per file overhead, and put that all tightly in a single place on a HDD, making any file scans a breeze even without SSDs. Metadata still could be cached in RAM to have the time-to-first-byte low, sure, why not—but it could focus on caching metadata for actually frequently used files, and cache would not have to be trashed by a file walker process every so often. So, my Microserver gen7 with 16 GB of RAM could then manage ~120 TB worth of nodes while keeping all metadata in RAM. Or, going by the same numbers, an RPi4 with 4 GB of RAM would be enough to have ~30 TB worth of nodes. In both cases reaching the same effects as your ZFS setup. Or, in both cases, manage even larger datasets at the cost of occasional metadata lookup, probably mostly for audits and repairs, as customer traffic tends to have patterns.

Of the general-purpose file systems, ZFS is good enough for nodes given decent hardware specs, as your setup shows. Can’t disagree with this statement. So is ext4 though, given the same setup: large amount of RAM for disk cache or a standard block-level SSD cache.