Best Record size for zfs

kevink · December 26, 2020, 11:54am

I was talking with @SGC a bit and thought that my answer might be of interested to more people:

I have been thinking about recordsize a bit more and the use of compression in zfs. We have always been trying to find the best recordsize for storj and here are my considerations:

Preventing read/write amplification (wasting of bandwidth)

In general you use a smaller recordsize (default 128K) because files can be accessed and modified partly. This is the most obvious with databases (also applies to storagenode databases). The DB might be 4MB big but you only want to access 12KB in the middle of the file and modify 6KB at the end of the DB. So to read those 12KB you need to read one block of the size of the recordsize. So if you use 1M as recordsize, you’d need to read 1M for 12KB. That’s a huge read amplification, wasting lots of bandwidth, even though you only need one operation. So a better recordsize would be 16KB.
If you want to write/modify 6KB to the DB, you need to read one block and write a new block. So with 1M recordsize you’d have a huge read and write amplification because you read 1MB, modify 6KB in it and then write 1MB back to the disk. With a recordsize of 16KB that would be better. Would I recommend 16KB for database datasets? I actually would. That’s what I read about using zfs with mysql.
For logs I am actually unsure. The compression would probably benefit from a higher recordsize and due to compression there might not even be a write amplification as the last block of the logfile will only be as big as needed. Read amplification will also not occur as only that last compressed block will be read to append something to the file (and that record would stay cached in the (l2)arc anyway).

Reducing needed IOPS

Contrary to the DB use-case, the storagenode storage of all pieces behaves completely different. The files we receive range from 4KB to 2.3MB, most of them are ~2.3MB. Those are written once and never again modified, so we don’t have to be concerned about any write amplifications anyway because those files never get modified again. So by choosing a recordsize of 1MB, we save IOPS because a file with 2.3MB can be written in 3 operations, whereas the default of 128KB would need 19 operations. This could potentially be beneficial on SMR drives, as those have trouble keeping up with a higher load of sustained write operations. However, adjacent I/O operations get merged automatically, so if a file is not fragmented and can be written/read in one go from one location, the operations get merged. So in this case the recordsize doesn’t even make a difference. One could argue though that is saves some overhead because every block has metadata like the checksum. A block with 1MB recordsize needs less metadata than 8 blocks with 128KB recordsize.
What about read amplification? A 16KB file is stored in a 1MB recordsize, so it needs 1 operation no matter the recordsize but has a huge read amplification as we are wasting 1MB-16kB of bandwidth. That’s where zfs shows its strength. The compression (lz4 or lze, doesn’t really matter which one) will compress all that empty space at the end of the record after the 16kB file and the resulting block would have a recordsize of 16kB. (Edit: seems like this should work without compression actually, but haven’t tried it). Therefore with compression enabled, there is no read amplification. (The CPU cycles needed for decompression are basically irrelevant at current node traffic). So using a recordsize of 1MB will save IOPS for reading files without read amplification. A file with 2.3MB can be read in 3 operations while the default recordsize needs 19 operations. The zfs recordsize is always a maximum. If your file is smaller, so will be the recordsize. And es mentioned before, adjacent read operations will get merged, so if the file was written to the disk without any fragmentation, you are actually not saving any IOPS because the OS does save those IOPS automatically. It’ll only get more relevant when files get stored fragmented.
Edit: I actually just read that this dynamic shrinking of recordsizes should work without compression active. But as written below, compression is not a disadvantage to have.

What about the cache (l2)arc and slog?

The only thing I’m unsure here is how this affects the (l2)arc. Will the compressed record be stored in the (l2)arc or the “decompressed”? The “decompressed” would indeed be a waste of (l2)arc RAM/SSD storage as every 16KB file would use 1MB in those caches. but that’s what I don’t know and didn’t find much information about. However, since the recordsize is variable, it doesn’t make any sense to expand a 16KB file to a 1MB recordsize of the dataset because the block of the file would have a recordsize of 16KB, the rest is/was just empty space. So I actually assume that it doesn’t waste much RAM and only loads 16KB into the RAM.
The (l2)arc caches on the storage of a storagenode actually gives no benefit because files will hardly ever be downloaded twice within a short duration to make it worth keeping those files in the cache. I wish there was an option to disable caching of certain datasets.
The DBs however benefit greatly from the (l2)arc cache so the database doesn’t always need to be loaded from the disk. The slog it will greatly reduce the required write IOPS because changes will be written to the slog and kept in RAM until the next regular flush to the drive. This will be greatly beneficial for SMR drives. During the time between those regular flushed there might even be multiple changes to the DB which will all end up in RAM and SLOG instead of constantly being written to the disk.

Will compression save space?

No it will definitely not. My dashboard shows almost exactly the same usage in TB as my zfs data written record (in TiB, has to be converted to TB for comparison). The storagenode data is encrypted, so there is nothing to compress. The DBs and logs are compressable but they are a rather small amount of data compared to the stored pieces. But compression doesn’t really take much performance or add latency so it’s completely safe to use.

Recommendations

So to sum it up, I recommend using a higher recordsize, whatever you feel comfortable. I have been running with 1MB for a long time but on my last migration switched to 512KB (but after writing this I feel like switching to 1MB again). I recommend using compression, just in case, even though dynamic recordsizes should work without it.
The logs can stay in the storage dataset with 1MB recordsize.
As edited, the OS merges adjacent read/write operations and we actually don’t expect files to be fragmented when stored on the HDD, therefore recordsize wouldn’t matter. However, with an increasing number of deletes and new files being stored, there is a chance that new files are getting increasingly fragmented (I’m not quite sure how zfs chooses a file’s location but I doubt it always searches for a space to fit it in completely? Please correct me if you know more), especially the closer you get to a full HDD. So this reason reinforces my conclusion to use a higher recordsize.

If you feel comfortable, you might create an additional dataset for the DBs with a recordsize of 16KB (but I haven’t done that myself as with the caching and the low traffic I have no HDD problems. With an SMR this might be interesting). The DB will definitely get quite fragmented quickly and will benefit most from caches or even storing it on SSDs instead of the HDD (if you use an SMR that’s good advice, for normal CMR HDDs it shouldn’t be a problem, at least not with a cache).

If you think some of my conclusions are wrong, please correct me. Many of these are just what I read and theoretical. I haven’t actually benchmarked everything or looked it up in the code.