Best Record size for zfs

Firmw4re · December 25, 2020, 11:56am

Hi all

Currently waiting on hardware to get delivered. For a new big node.
But I just wanted to check what record size is best for storj? I was think 32k would be okay but what are you guys thoughts ?

nerdatwork · December 25, 2020, 12:00pm

This might be helpful.

SGC · December 25, 2020, 9:23pm

Cut this down to size

if using ZFS recordsize is default 128k for a reason… it works great in almost all loads and overlaps with the best hardware performance out of hdd’s also.

for storj i think 256k ZFS recordsize is the best without a doubt, because zfs has variable recordsizes so it will decrease recordsize if it can… the only detrimental effect to larger recordsizes are that it will clear more memory and cache because that is calculated depending on recordsizes and so the max is used.

which is why 512k and 1mb is not worth using for storj with zfs… it’s simply not worth it.
but larger recordsizes do make storagenode migrations much faster… which is why i ended up giving it a notch over the default of 128k.

check your manuals and hardware guides, the workload is something between a database and a fileserver, and you will most likely be using something like 64k - 128k but 32k might also work fine… but for zfs i certainly won’t recommend it.

remember to get the correct ashift on the first go… else you will be sorry later.
recordsize can be changed in zfs without destroying the pool, ashift cannot.

setting an incorrect ashift will be much more detrimental and difficult to correct than anything you can do with recordsizes… using an ashift 12 on disks that should have been ashift 9 can decrease your iops by a factor of 8, for some workloads.

Firmw4re · December 25, 2020, 9:45pm

Oké thanks I am going for zfs and will implement the 256k size like:

sudo zfs set recordsize=256k data/storj

Is there a reason why 256k is better for storj?

kevink · December 25, 2020, 9:51pm

I use 512K. Well most files have a size of ~2.3MB so it kinda fits better (uses less IOps without wasting performance) but there are different experiences and recommendations. But be sure to use compression so the recordsize fits to the filesize and you’re not wasting space. There are also many many small files.
We had some discussion in Zfs discussions

Firmw4re · December 25, 2020, 10:06pm

Doesn’t compression just add unnecessary latency to your node? I mean… it will not save you allot of space as the data is already encrypted.

kevink · December 25, 2020, 10:15pm

lz4 is pretty fast so the latency is not relevant actually but you could even switch to lze. The only reason to use compression is to shorten the records. E.g. if you have a 2.3MB file and 1MB recordsize then you’d need 3 records to store that file. Then you have 3MB of used space, even though that file is only 2.3MB. With compression that last record would be shortened to 0.3MB. That’s why I have a compressratio of 1.3 on my latest node dataset with 512K recordsize.
Apart from that you are right, you shouldn’t be able save any real space because the data is encrypted.

kevink · December 26, 2020, 11:54am

I was talking with @SGC a bit and thought that my answer might be of interested to more people:

I have been thinking about recordsize a bit more and the use of compression in zfs. We have always been trying to find the best recordsize for storj and here are my considerations:

Preventing read/write amplification (wasting of bandwidth)

In general you use a smaller recordsize (default 128K) because files can be accessed and modified partly. This is the most obvious with databases (also applies to storagenode databases). The DB might be 4MB big but you only want to access 12KB in the middle of the file and modify 6KB at the end of the DB. So to read those 12KB you need to read one block of the size of the recordsize. So if you use 1M as recordsize, you’d need to read 1M for 12KB. That’s a huge read amplification, wasting lots of bandwidth, even though you only need one operation. So a better recordsize would be 16KB.
If you want to write/modify 6KB to the DB, you need to read one block and write a new block. So with 1M recordsize you’d have a huge read and write amplification because you read 1MB, modify 6KB in it and then write 1MB back to the disk. With a recordsize of 16KB that would be better. Would I recommend 16KB for database datasets? I actually would. That’s what I read about using zfs with mysql.
For logs I am actually unsure. The compression would probably benefit from a higher recordsize and due to compression there might not even be a write amplification as the last block of the logfile will only be as big as needed. Read amplification will also not occur as only that last compressed block will be read to append something to the file (and that record would stay cached in the (l2)arc anyway).

Reducing needed IOPS

Contrary to the DB use-case, the storagenode storage of all pieces behaves completely different. The files we receive range from 4KB to 2.3MB, most of them are ~2.3MB. Those are written once and never again modified, so we don’t have to be concerned about any write amplifications anyway because those files never get modified again. So by choosing a recordsize of 1MB, we save IOPS because a file with 2.3MB can be written in 3 operations, whereas the default of 128KB would need 19 operations. This could potentially be beneficial on SMR drives, as those have trouble keeping up with a higher load of sustained write operations. However, adjacent I/O operations get merged automatically, so if a file is not fragmented and can be written/read in one go from one location, the operations get merged. So in this case the recordsize doesn’t even make a difference. One could argue though that is saves some overhead because every block has metadata like the checksum. A block with 1MB recordsize needs less metadata than 8 blocks with 128KB recordsize.
What about read amplification? A 16KB file is stored in a 1MB recordsize, so it needs 1 operation no matter the recordsize but has a huge read amplification as we are wasting 1MB-16kB of bandwidth. That’s where zfs shows its strength. The compression (lz4 or lze, doesn’t really matter which one) will compress all that empty space at the end of the record after the 16kB file and the resulting block would have a recordsize of 16kB. (Edit: seems like this should work without compression actually, but haven’t tried it). Therefore with compression enabled, there is no read amplification. (The CPU cycles needed for decompression are basically irrelevant at current node traffic). So using a recordsize of 1MB will save IOPS for reading files without read amplification. A file with 2.3MB can be read in 3 operations while the default recordsize needs 19 operations. The zfs recordsize is always a maximum. If your file is smaller, so will be the recordsize. And es mentioned before, adjacent read operations will get merged, so if the file was written to the disk without any fragmentation, you are actually not saving any IOPS because the OS does save those IOPS automatically. It’ll only get more relevant when files get stored fragmented.
Edit: I actually just read that this dynamic shrinking of recordsizes should work without compression active. But as written below, compression is not a disadvantage to have.

What about the cache (l2)arc and slog?

The only thing I’m unsure here is how this affects the (l2)arc. Will the compressed record be stored in the (l2)arc or the “decompressed”? The “decompressed” would indeed be a waste of (l2)arc RAM/SSD storage as every 16KB file would use 1MB in those caches. but that’s what I don’t know and didn’t find much information about. However, since the recordsize is variable, it doesn’t make any sense to expand a 16KB file to a 1MB recordsize of the dataset because the block of the file would have a recordsize of 16KB, the rest is/was just empty space. So I actually assume that it doesn’t waste much RAM and only loads 16KB into the RAM.
The (l2)arc caches on the storage of a storagenode actually gives no benefit because files will hardly ever be downloaded twice within a short duration to make it worth keeping those files in the cache. I wish there was an option to disable caching of certain datasets.
The DBs however benefit greatly from the (l2)arc cache so the database doesn’t always need to be loaded from the disk. The slog it will greatly reduce the required write IOPS because changes will be written to the slog and kept in RAM until the next regular flush to the drive. This will be greatly beneficial for SMR drives. During the time between those regular flushed there might even be multiple changes to the DB which will all end up in RAM and SLOG instead of constantly being written to the disk.

Will compression save space?

No it will definitely not. My dashboard shows almost exactly the same usage in TB as my zfs data written record (in TiB, has to be converted to TB for comparison). The storagenode data is encrypted, so there is nothing to compress. The DBs and logs are compressable but they are a rather small amount of data compared to the stored pieces. But compression doesn’t really take much performance or add latency so it’s completely safe to use.

Recommendations

So to sum it up, I recommend using a higher recordsize, whatever you feel comfortable. I have been running with 1MB for a long time but on my last migration switched to 512KB (but after writing this I feel like switching to 1MB again). I recommend using compression, just in case, even though dynamic recordsizes should work without it.
The logs can stay in the storage dataset with 1MB recordsize.
As edited, the OS merges adjacent read/write operations and we actually don’t expect files to be fragmented when stored on the HDD, therefore recordsize wouldn’t matter. However, with an increasing number of deletes and new files being stored, there is a chance that new files are getting increasingly fragmented (I’m not quite sure how zfs chooses a file’s location but I doubt it always searches for a space to fit it in completely? Please correct me if you know more), especially the closer you get to a full HDD. So this reason reinforces my conclusion to use a higher recordsize.

If you feel comfortable, you might create an additional dataset for the DBs with a recordsize of 16KB (but I haven’t done that myself as with the caching and the low traffic I have no HDD problems. With an SMR this might be interesting). The DB will definitely get quite fragmented quickly and will benefit most from caches or even storing it on SSDs instead of the HDD (if you use an SMR that’s good advice, for normal CMR HDDs it shouldn’t be a problem, at least not with a cache).

If you think some of my conclusions are wrong, please correct me. Many of these are just what I read and theoretical. I haven’t actually benchmarked everything or looked it up in the code.

Firmw4re · December 26, 2020, 12:22pm

This is really helpful thank you so much

I think I got it now. So zfs is really smart and the record size that you choose is an absolute maximum. Zfs will check what size is needed per file. For example, if you create a 4 KB text file, ZFS will store that in a 4 KB record. If you create a 12 KB text file, ZFS will store that in a 16 KB record. If you create a 720 KB text file, ZFS will store that in a 1 MB record. Etc.

That’s why lz4 compression is really cool.

I will run with:

1mb recordsize, and lz4 compression. Still waiting on my hardware to get delivered so plenty of time to work everything out

SGC · December 26, 2020, 1:06pm

yeah i think kevink’s last post covers everything very well in an orderly way, i will add that the read / write amplification affects ram and cache utilization also…
more or less for the same reasons, but i don’t really fully understand all the details of it…
just retelling what i’ve read and what seems to happen when i tested it out also.

and again you can change recordsize when you want… so it’s not a big issue

kevink · December 26, 2020, 1:49pm

It will shrink the recordsize down to the filesize in intervals of the logical blocksize, so for ashift=12 that would be 4KB.
So a 9KB file would actually need 12KB (if it is completely uncompressable), a 720KB file would get a recordsize of 720KB (if it’s not compressible).

Actually I read that this dynamic recordsize should work without compression too but I was unsure about it. But since compression doesn’t really take much performance (even less with lze iirc), it’s not a problem and compresses logs nicely.

Pac · December 26, 2020, 3:45pm

I would have thought that the disk hardware would take care of combining several operations together.
For instance, if the blocksize is 128KiB, a 2.3MiB file would query 19 blocks as @kevink said, but I would have thought that the disk would combine these 19 operations in one or two… ?

Or if not the disk… the OS? I mean the disk is not really going to waste 19 IOPS to retrieve the file, is it?

kevink · December 26, 2020, 4:19pm

I guess you are partially right.

Netdata has a graph described with:

The number of merged disk operations. The system is able to merge adjacent I/O operations, for example two 4KB reads can become one 8KB read before given to disk.

And I think that’s the main point here: If the blocks are adjacent, then the IOPs can be merged. If they are not, then you need e.g. 19 operations to read a file.
I have no idea how fragmented storagenode files are but I would assume that they are typically not fragmented as it should be possible to find one space to fit all blocks (but I don’t know how zfs chooses their space). And since those files don’t ever get modified, they won’t ever get fragmented. Therefore you are right, those 19 operations would actually get merged, meaning there is no difference in the recordsize in this case. I do however have even less knowledge about how SMRs handle it when writing files.

For the DBs however this doesn’t apply much as those are getting quite fragmented due to COW.

I have edited my post above to reflect the new information, thanks.

Pac · December 26, 2020, 5:58pm

COW ?! What’s that?

By the way that’s one of the things that should be improved I think: Nodes should automatically defrag their databases once in a while. Or when starting up, before launching the filewalker.

SGC · December 26, 2020, 6:42pm

there really is no 128k block size… it’s a software thing… the hardware has certain fixed parameters for hdd’s which define it’s “blocksize” so you cannot write anything smaller than the hardware allows…

going above has some advantages, because you end up with less fragmentation, and thus more of the data is sequential reads or writes, which hdd’s are much faster for… also if you are thinking raid, the sectors of each hdd in the raid array not counting the redundancy level.
so if you had a 6 disk raid6 or raidz2 then you would have 4x 4k physical sectors (blocks) on the hdds which makes 1 IO of the array, thus running anything less than a blocksize 16k would be detrimental to the raid arrays utilization of space.

much like you guys already discussed with the blocksizes.
disk iops are very tricky because with very low data streams it can get very low…
but if we say like we can read from a hdd at 120mb/s to take an easy realistic number.
thats 3x 40mb’s so about 1000 iops read sequential… because that only works when stuff is in a row.

the real issue comes when the disk have to move the head, that takes forever, which is why it can take it a second to write 1000 files with no data in them… which is basically just a few KB’s of data in a sec
while it could do 120000KB sequentially aka 120MB

so the reason to go beyond 4k really isn’t a hardware thing, it helps the hardware work better and then if you have compression and you compress something lets say you are using 1k blocks and compress it with lz4… so it becomes 1k… but that’s still written on disk in a 4k block… so increase the “virtual” block size and you gain all kinds of advantages, that really aren’t needed…

pretty sure NTFS for windows which have and is used on most machines today still uses 4k perfectly fine… because it’s not an advanced file system and doesn’t have all these fancy features that gives new interesting options…

other advantage to using bigger blocks is overhead, it allows zfs to use checksums which allows to verify and locate errors, sure the more primitive type of checksums the CRC which the hdd and file system will use… basically everything uses CRC inside a computer or datacable

but because of scaling up the blocksize one can have more room for a checksum and instead of just being able to identify that the data is bad like a network does or anything CRC based does in general.
then you can with a checksum actually correct a bit error and reuse the data, which is cool.

ofc having all this amazing spare room in each block opens up all kinds of fancy… like CoW
Copy on Write is based on a marker system, basically zfs uses road signs which points to files and it never overwrites a file, it writes a new file and then changes the marker that leads you to the old files location and then move up towards the UBER block which is the pool or something.
thus power loss mid write just makes you default to the old version of a file, even if it was in the process of being overwritten and thus helps reduce corruption.

but like we previously said, eventually one runs in to other issues… oracle uses up to 16MB blocksizes
not sure what kind of fancy advantages they reach at those levels… but i cannot imagine my server running it…

some older NAS solution like Netdata ran using 527byte sector sizes instead of 512B, so that they could have room for checksums… most likely also what made them popular at those times, because it wasn’t as well know as it is today.

data degrades… more than you might think… 60% of all issue you had with computers was most likely due to data corruption, and then maybe 20% from outside influences and then maybe 20% user error…
tho in my case i think we might need to flip the ratios and say 60% user error … is it an error is if’s partially on purpose.?

copy on write

with zfs configured correctly you will never need to worry about defragmentation, it’s an artifact of how old systems dealt with data storage.

kevink · December 26, 2020, 7:48pm

database defragmentation is not the same as defragmentation on disk. And it wouldn’t last long either, only until the next change, then the changed blocks will be somewhere else and the DB starts getting fragmented again.
(But due to the COW - copy-on-write - the old block will still be there until removed eventually. But COW is actually something that’s good in this case, it prevents DB corruptions because if the changes can’t be written correctly to disk, zfs just uses the old block and the DB will be fine.)

kevink · December 26, 2020, 7:59pm

looks like we might have read a similar article, I just like to add that COW actually only copies the modified block, not the whole file (well unless you modify the whole file or the file is only using one block).
COW fragments files heavily but zfs only syncs changes to the HDD every 5 (or 10?) seconds to reduce some amount of fragmentation. A DB is however a synchronous write and will be written to disk immediately, therefore a SLOG is immensly helpful as the changes get written to the SLOG and kept in RAM until the next regular flush/sync to the HDD. Also shows how important ECC RAM is with zfs but it works fine without it too (and I don’t have ECC RAM either…).

SGC · December 26, 2020, 8:59pm

ECC just makes errors much less likely, only redundancy truly solves the issue, if anything really does solve those kinds of issues…

i read and watched a lot of lectures on zfs, still lots to learn tho…
yeah the zil concept i suspect is for the exact same reason as the slog… and why zfs even bothers with that nonsense…

when one starts out and learns about the zil… it seems quit stupid to write stuff twice to the disk…
zfs doesn’t write like regular files systems either… it will write into the best suited space on the disk… and because it’s in segments of 5 sec of data, then when one does the math…

a regular file system might in worst cases write lets say 500 iops which might be optimistic in the example but it’s nice for the math.

where it fills 500 x 4k holes on the hdd if heavily fragmented, that would basically make it do 500 times in a second what zfs does every 5 seconds… so basically it’s could be x2500 times faster at fragmenting it’s drive… so in 1 day you can fragment a regular NTFS file system as much as zfs would in 7-8 years.

maybe if you ran a zfs pool at + 90% capacity for a long time it might create a cascade like effect … not sure…

but from how i understand the base concept it should be near impossible for zfs to fragment in the way that other files systems does… which is why there isn’t any defragmentation…

not saying you cannot fragment your drive… but i’m saying it’s more user error… i’m not sure where the max is tho… i suspect it would be past the 90% capacity and it would be like going through the ice on a frozen lake… everything seems to go smoothly… maybe a bit of cracking… and then

immediate cascade effect from when it has so low space that one write gets put into many places, ofc it should correct itself if one gets down to like 80% capacity i guess… the pro’s seems to say … don’t use zfs pools over 80% capacity…

i would now guess that point would be where … or most likely higher… but the lower it goes the better odds for recovery… so 80% will recover your pool from fragmentation would be my guess…
people keep saying the 80% but i have yet to see anyone actually explain why… and i know people have tested it past 90% without any outside the normal scope of detrimental effects…

but like i say, it will most likely be a very dramatic point when the pool breaks into true fragmentation.

yeah… well it’s complicated… not sure if zfs runs checksums on the file level to… but yeah at the very least the records / blocks will be the CoW…
but then you would get the fragmentation kinda bad wouldn’t you…
if you make a change in a file you want the entire file in the same place and thus you will need to copy the old file with the new changed added to one sequential record series, so that you can read it easily next time… else you will need to jump around …

so tho i suppose it’s possible… and might happen depending on priority and what not… i don’t think zfs will change a single record and write that… maybe it could not delete the old file and if you often change it, then it will simply jump back and forth…

and ofc the records also doesn’t have to be in the right sequence… just so long as they are in sequence…
well zfs is very advanced, the best we can do is work with some approximations that are wildly wrong, but explain it’s complex behavior in simple ways, even most of the developers admit they got no clue about how it all works… they find a thing they want to improve and then they try to implement it.
sometimes it takes them like 2-3 years or more to do one of those upgrades…

so yeah you are right… but zfs shouldn’t fragment and i guess it may have some magical ways of fixing the whole rewrite of files to ward against fragmentation…

i really like the idea of just adding the records in a row… and then maybe the second file record sequence in the same row with some room inbetween… then you can add a record to either when needed and remove those records that gets out dated… and since the order of the records doesn’t matter just that the file records are in sequence, then everything is sort of right…

not sure how the system manages disorganized data tho… but i suppose when it hits ram it doesn’t really matter anymore because we go from MS to NS and the order is basically irrelevant because we can reorder them basically at will.

kalloritis · January 10, 2021, 9:17pm

ahem- >.>

So it goes with the basis that you’re operating the filesystem outside of recommended spec/best practices. Generally <=80% is said to ensure there’s enough contiguous free space sections to allow the CoW nature of ZFS to work best by not having to fragment data blocks and has many holes to use that may be in more performant sections of the disk. As well, once the new “version” for the file is durable, the old block is ready for GC if there isn’t snapshots associated with the now old version (don’t mix snapshots and Storj’s data store, fyi).

kevink · January 10, 2021, 9:24pm

You’re definitely right about that!

Just for fun:

NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storj1  7.27T  6.84T   431G        -         -    16%    94%  1.00x    ONLINE  -
storj2  7.27T  3.76T  3.50T        -         -     9%    51%  1.00x    ONLINE  -
storj4  7.27T  2.99T  4.28T        -         -     8%    41%  1.00x    ONLINE  -

I copied my old node from a raidz1 onto the disk and was glad that it even fit onto it It was actually a little bit bigger directly after the move. But as it was one rsync operation, my fragmentation is way better than SGC’s numbers.