Zfs discussions

kevink · May 14, 2020, 6:44pm

In this screenshot you can see a difference in IOPS depending on the recordsize. Interesting also that the backlog increases visibly with 1M recordsize.
The real effect on the node however is not so obvious.

SGC · May 14, 2020, 6:48pm

to my understand its either IO or bandwidth… the issue with big recordsizes is that if you are to load 1000 1M recordsize files into cache then in some cases it will actually take up 1M and thus limit the throughput or max capacity… so like say you are checking metadata… does each metadata record then take up 1mb in memory…

i saw one of the zfs developers talk about this in a lecture… i can see if i cannot find it sometime, you can do large recordsizes, but its by far not always advantageous… just like tiny little recordsizes are not always advantageous either… it depends a lot on the workload you are doing…

you want it to run somewhere sensible… 16k seems low to me, but i’m testing it because i would have changed but for some reason i didn’t or it affected my disk latency when i was working on the problematic drive yesterday, and just earlier i did turn it off again while resilvering i think it was… and my latency immediately went up… so i turned it back down to 16k from 64k

my main worry with 16k is how much space i will end up using for checksums… because i cannot remember exactly how big the zfs checksums are … and i don’t think the checksums change depending on the recordsizes… so with 16k vs 128k you will have 8 times the checksum data… could become significant…

but then it also means everything is chopped into smaller bit… which means the computer can stop move on to something else and then comeback to what the computer where doing in less time that it might transfer a 128k record
there isn’t one right answer, i can depend on your disks, your system, your use case… what you like…tho there are some very physical limits at either end of the spectrum which is why there exists a lockout for going above 1mb recordsizes… it can fuck up your performance to unreasonable heights by tinkering with it…
from 16K to 1M well the system would run most of the time… and be slow or really slow at some special tasks… while at 16M your system could basically stall so hard it might take years to complete a task that could take a week in 16K

128K has been deemed the generally most balanced point… for throughput, and I/O, so thats the middle of the scale… 1M is like 8 times faster at throughput with low IO, but again you run into some hardlimits of you cannot really push though … well i can do 4000 reads on my zpool… i cannot push through 4000M so i stand to benefit little aside from slightly lower IO for moving that throughput…
and so i can move all i can with lets say 650 IO, then i still have over 3k left… so that moved 650 files pr sec… but if the files where smaller than 1M then i would still use 1M bandwidth to send them over the bus… in most cases… so if i instead was running 128k then i would get 8 times the files a second, so long as they will fit inside a 128k record…thus i could move my max IO at about 4k files a sec instead of 650 file…

i think i finally manage to explain it right there… and the same goes for memory… if you want to put 4k files in memory and change them… then you have to uncompress them, thus a record takes up its fullsize… ofc like krey said earlier those might be variable…
rings a bell… but not sure it ain’t just because of compression… xD

anyways large or small recordsizes all come at a cost…its just a matter of which one you are willing to pay or can pay because it doesn’t matter.

oh and it’s most likely why your scrubs take so long…partially anyways…

Ahhh scrubs…
duno why i dreaded this so much… its going fine doing scrubbing… and such a nice time to try to performance tune a little…

Krey · May 14, 2020, 8:20pm

I check file sizes on one node.
Lookup directory /mnt/storj/node04/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/
Found 1181529 entries, 44968 ms
Calculating… done 14344 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
512 b          1 KiB          1 KiB          [               ] 8766        [               ] 8766        [               ] 8.56 MiB    [               ] 8.56 MiB
1 KiB          2 KiB          1.51 KiB       [--             ] 188787      [--             ] 197553      [               ] 278 MiB     [               ] 286 MiB
2 KiB          4 KiB          2.84 KiB       [-              ] 119328      [----           ] 316881      [               ] 331 MiB     [               ] 618 MiB
4 KiB          8 KiB          5.67 KiB       [               ] 67599       [----           ] 384480      [               ] 374 MiB     [               ] 992 MiB
8 KiB          16 KiB         11.5 KiB       [               ] 47230       [-----          ] 431710      [               ] 530 MiB     [               ] 1.49 GiB
16 KiB         32 KiB         23 KiB         [               ] 43700       [------         ] 475410      [               ] 982 MiB     [               ] 2.45 GiB
32 KiB         64 KiB         46.3 KiB       [               ] 43658       [------         ] 519068      [               ] 1.93 GiB    [               ] 4.37 GiB
64 KiB         128 KiB        92.1 KiB       [               ] 42500       [-------        ] 561568      [               ] 3.73 GiB    [               ] 8.1 GiB
128 KiB        256 KiB        184 KiB        [               ] 41466       [-------        ] 603034      [               ] 7.28 GiB    [               ] 15.4 GiB
256 KiB        512 KiB        367 KiB        [               ] 38379       [--------       ] 641413      [               ] 13.4 GiB    [               ] 28.8 GiB
512 KiB        1 MiB          735 KiB        [               ] 37034       [--------       ] 678447      [               ] 26 GiB      [               ] 54.8 GiB
1 MiB          1.5 MiB        1.24 MiB       [               ] 20642       [--------       ] 699089      [               ] 24.9 GiB    [-              ] 79.7 GiB
1.5 MiB        2 MiB          1.78 MiB       [               ] 65221       [---------      ] 764310      [-              ] 114 GiB     [--             ] 193 GiB
2 MiB          2.5 MiB        2.21 MiB       [-----          ] 416195      [---------------] 1180505     [------------   ] 898 GiB     [---------------] 1.07 TiB

Total 1180505 files 1.07 TiB bytes, avg size 487 KiB
Lookup directory /mnt/storj/node04/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/
Found 957628 entries, 6479 ms
Calculating... done 17238 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
512 b          1 KiB          1 KiB          [               ] 35          [               ] 35          [               ] 35 KiB      [               ] 35 KiB
1 KiB          2 KiB          1.25 KiB       [               ] 4           [               ] 39          [               ] 5 KiB       [               ] 40 KiB
256 KiB        512 KiB        355 KiB        [               ] 1           [               ] 40          [               ] 355 KiB     [               ] 395 KiB
512 KiB        1 MiB          1021 KiB       [-              ] 114949      [-              ] 114989      [               ] 112 GiB     [               ] 112 GiB
1.5 MiB        2 MiB          1.8 MiB        [               ] 4212        [-              ] 119201      [               ] 7.4 GiB     [               ] 119 GiB
2 MiB          2.5 MiB        2.21 MiB       [-------------  ] 837403      [---------------] 956604      [-------------- ] 1.77 TiB    [---------------] 1.88 TiB

Total 956604 files 1.88 TiB bytes, avg size 914 KiB


Lookup directory /mnt/storj/node04/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/
Found 595068 entries, 25135 ms
Calculating... done 10055 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
512 b          1 KiB          1 KiB          [               ] 1           [               ] 1           [               ] 1 KiB       [               ] 1 KiB
1 KiB          2 KiB          1.29 KiB       [               ] 59          [               ] 60          [               ] 76.2 KiB    [               ] 77.2 KiB
2 KiB          4 KiB          2.85 KiB       [               ] 17          [               ] 77          [               ] 48.5 KiB    [               ] 126 KiB
4 KiB          8 KiB          5.47 KiB       [               ] 20          [               ] 97          [               ] 110 KiB     [               ] 235 KiB
8 KiB          16 KiB         11.4 KiB       [               ] 24          [               ] 121         [               ] 272 KiB     [               ] 508 KiB
16 KiB         32 KiB         26.6 KiB       [               ] 136         [               ] 257         [               ] 3.53 MiB    [               ] 4.03 MiB
32 KiB         64 KiB         38.8 KiB       [               ] 103         [               ] 360         [               ] 3.91 MiB    [               ] 7.94 MiB
64 KiB         128 KiB        95.4 KiB       [               ] 94          [               ] 454         [               ] 8.76 MiB    [               ] 16.7 MiB
128 KiB        256 KiB        194 KiB        [               ] 153         [               ] 607         [               ] 29 MiB      [               ] 45.7 MiB
256 KiB        512 KiB        379 KiB        [               ] 261         [               ] 868         [               ] 96.7 MiB    [               ] 142 MiB
512 KiB        1 MiB          765 KiB        [               ] 413         [               ] 1281        [               ] 309 MiB     [               ] 451 MiB
1 MiB          1.5 MiB        1.25 MiB       [               ] 493         [               ] 1774        [               ] 616 MiB     [               ] 1.04 GiB
1.5 MiB        2 MiB          1.76 MiB       [               ] 544         [               ] 2318        [               ] 960 MiB     [               ] 1.98 GiB
2 MiB          2.5 MiB        2.21 MiB       [-------------- ] 591725      [-------------- ] 594043      [-------------- ] 1.25 TiB    [-------------- ] 1.25 TiB
4 MiB          8 MiB          6.41 MiB       [               ] 1           [---------------] 594044      [               ] 6.41 MiB    [---------------] 1.25 TiB

Total 594044 files 1.25 TiB bytes, avg size 896 KiB

Lookup directory /mnt/storj/node04/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/
Found 1671139 entries, 63686 ms
Calculating... done 27637 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
512 b          1 KiB          1 KiB          [               ] 10433       [               ] 10433       [               ] 10.2 MiB    [               ] 10.2 MiB
1 KiB          2 KiB          1.25 KiB       [               ] 8733        [               ] 19166       [               ] 10.7 MiB    [               ] 20.9 MiB
2 KiB          4 KiB          3.24 KiB       [               ] 1165        [               ] 20331       [               ] 3.69 MiB    [               ] 24.6 MiB
4 KiB          8 KiB          5.54 KiB       [               ] 7           [               ] 20338       [               ] 38.8 KiB    [               ] 24.6 MiB
8 KiB          16 KiB         12.2 KiB       [               ] 3           [               ] 20341       [               ] 36.8 KiB    [               ] 24.6 MiB
16 KiB         32 KiB         19.7 KiB       [               ] 6           [               ] 20347       [               ] 118 KiB     [               ] 24.7 MiB
32 KiB         64 KiB         34.8 KiB       [               ] 1           [               ] 20348       [               ] 34.8 KiB    [               ] 24.8 MiB
64 KiB         128 KiB        115 KiB        [               ] 1           [               ] 20349       [               ] 115 KiB     [               ] 24.9 MiB
128 KiB        256 KiB        170 KiB        [               ] 40031       [               ] 60380       [               ] 6.47 GiB    [               ] 6.5 GiB
256 KiB        512 KiB        353 KiB        [               ] 7           [               ] 60387       [               ] 2.41 MiB    [               ] 6.5 GiB
512 KiB        1 MiB          1021 KiB       [-              ] 197183      [--             ] 257570      [               ] 192 GiB     [               ] 199 GiB
1 MiB          1.5 MiB        1.17 MiB       [               ] 1           [--             ] 257571      [               ] 1.17 MiB    [               ] 199 GiB
1.5 MiB        2 MiB          1.99 MiB       [               ] 1           [--             ] 257572      [               ] 1.99 MiB    [               ] 199 GiB
2 MiB          2.5 MiB        2.21 MiB       [------------   ] 1412503     [-------------- ] 1670075     [-------------- ] 2.98 TiB    [-------------- ] 3.17 TiB
4 MiB          8 MiB          7.23 MiB       [               ] 4           [-------------- ] 1670079     [               ] 28.9 MiB    [-------------- ] 3.17 TiB
16 MiB         32 MiB         16 MiB         [               ] 36          [---------------] 1670115     [               ] 577 MiB     [---------------] 3.17 TiB

Total 1670115 files 3.17 TiB bytes, avg size 1.9 MiB

Lookup directory /mnt/storj/node04/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/
Found 585993 entries, 23223 ms
Calculating... done 6048 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
512 b          1 KiB          1 KiB          [               ] 7121        [               ] 7121        [               ] 6.95 MiB    [               ] 6.95 MiB
1 KiB          2 KiB          1.43 KiB       [--             ] 114899      [---            ] 122020      [               ] 161 MiB     [               ] 168 MiB
2 KiB          4 KiB          2.85 KiB       [-              ] 60139       [----           ] 182159      [               ] 168 MiB     [               ] 336 MiB
4 KiB          8 KiB          5.74 KiB       [-              ] 52799       [------         ] 234958      [               ] 296 MiB     [               ] 632 MiB
8 KiB          16 KiB         11.4 KiB       [-              ] 39081       [-------        ] 274039      [               ] 436 MiB     [               ] 1.04 GiB
16 KiB         32 KiB         23 KiB         [               ] 35161       [-------        ] 309200      [               ] 791 MiB     [               ] 1.81 GiB
32 KiB         64 KiB         46.2 KiB       [               ] 34443       [--------       ] 343643      [               ] 1.52 GiB    [               ] 3.33 GiB
64 KiB         128 KiB        92.1 KiB       [               ] 33563       [---------      ] 377206      [               ] 2.95 GiB    [               ] 6.28 GiB
128 KiB        256 KiB        185 KiB        [               ] 32866       [----------     ] 410072      [               ] 5.79 GiB    [               ] 12.1 GiB
256 KiB        512 KiB        367 KiB        [               ] 31569       [-----------    ] 441641      [               ] 11.1 GiB    [-              ] 23.1 GiB
512 KiB        1 MiB          735 KiB        [               ] 29400       [------------   ] 471041      [-              ] 20.6 GiB    [--             ] 43.7 GiB
1 MiB          1.5 MiB        1.23 MiB       [               ] 16076       [------------   ] 487117      [-              ] 19.4 GiB    [---            ] 63.1 GiB
1.5 MiB        2 MiB          1.73 MiB       [               ] 10939       [------------   ] 498056      [-              ] 18.5 GiB    [----           ] 81.6 GiB
2 MiB          2.5 MiB        2.21 MiB       [--             ] 86913       [---------------] 584969      [----------     ] 187 GiB     [---------------] 269 GiB

Total 584969 files 269 GiB bytes, avg size 483 KiB

Lookup directory /mnt/storj/node04/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
Found 602766 entries, 23144 ms
Calculating... done 6097 ms
1024 directories

Size from      Size to        Avg size       File count                    Accumulated count             File size sum                 Accumulated size
---------      -------        --------       ----------                    -----------------             -------------                 ----------------
0              512 b          512 b          [               ] 1           [               ] 1           [               ] 512 b       [               ] 512 b
512 b          1 KiB          1021 b         [               ] 7123        [               ] 7124        [               ] 6.94 MiB    [               ] 6.94 MiB
1 KiB          2 KiB          1.45 KiB       [--             ] 103511      [--             ] 110635      [               ] 147 MiB     [               ] 154 MiB
2 KiB          4 KiB          2.86 KiB       [-              ] 53669       [----           ] 164304      [               ] 150 MiB     [               ] 304 MiB
4 KiB          8 KiB          5.75 KiB       [-              ] 46882       [-----          ] 211186      [               ] 264 MiB     [               ] 567 MiB
8 KiB          16 KiB         11.5 KiB       [               ] 34890       [------         ] 246076      [               ] 391 MiB     [               ] 959 MiB
16 KiB         32 KiB         23 KiB         [               ] 31382       [------         ] 277458      [               ] 705 MiB     [               ] 1.62 GiB
32 KiB         64 KiB         46.3 KiB       [               ] 31098       [-------        ] 308556      [               ] 1.37 GiB    [               ] 3 GiB
64 KiB         128 KiB        90.9 KiB       [               ] 31795       [--------       ] 340351      [               ] 2.76 GiB    [               ] 5.75 GiB
128 KiB        256 KiB        207 KiB        [-              ] 75418       [----------     ] 415769      [               ] 14.9 GiB    [-              ] 20.7 GiB
256 KiB        512 KiB        331 KiB        [-              ] 59647       [-----------    ] 475416      [-              ] 18.8 GiB    [--             ] 39.5 GiB
512 KiB        1 MiB          735 KiB        [               ] 25836       [------------   ] 501252      [-              ] 18.1 GiB    [---            ] 57.6 GiB
1 MiB          1.5 MiB        1.23 MiB       [               ] 14018       [------------   ] 515270      [               ] 16.9 GiB    [----           ] 74.5 GiB
1.5 MiB        2 MiB          1.73 MiB       [               ] 9704        [-------------  ] 524974      [               ] 16.4 GiB    [-----          ] 90.9 GiB
2 MiB          2.5 MiB        2.21 MiB       [-              ] 76768       [---------------] 601742      [---------      ] 165 GiB     [---------------] 256 GiB

Total 601742 files 256 GiB bytes, avg size 450 KiB

so i don’t see a profit to optimize storage to Kib files. Mostly used storage at files 2.21MiB at all satellites

Look Accumulated size pips grow

SGC · May 14, 2020, 8:30pm

i’ve decided to stick with 32k recordsizes for the rest of the month, see how that goes… i like the node performance i get while scrubbing on 32k records… but who knows… takes a long time to really get an idea… started at 128k(default) ran 256k for a while because i thought i was smart… xD but didn’t really agree with my system, ate to much memory and slowed stuff down… then i switched to 16k for a i duno 3-4 days seemed to run really well, but didn’t quite see the internal system bandwidth i wanted while scrubbing and resilvering, so i jumped to 64k for 20 odd minutes which made everything grind down much slower… so i set it at 32k and got excellent results on all front’s … so atleast for now thats my recordsize of choice… but again this can vary alot from system to system even…
matching directly to filesize i don’t think is a good idea… there are many considerations to selecting blocksizes / recordsizes.

littleskunk · May 14, 2020, 8:40pm

If we are lucky we might never have to answer that question. I shared my observation with the developer team today. My previous setup was a QNAP with no option to install a write cache. What is needed to get the best performance out of these boxes? Bigger pieces would help. A long time ago we started with a segment size of 64MB. That are the 2MB pieces we have on disk. Now it is time to evaluate what we have on disk and what our target is. I don’t have details yet but it looks like we want to write bigger pieces to disk.

What happens if I increase my recordsize later? Does that work?

Odmin · May 14, 2020, 8:42pm

Yes, it will work, but only for new data

Krey · May 14, 2020, 8:42pm

only for new writings, at least until you copy the entire dataset. I did this for each entry in my pivot table. It took a long time.

Odmin · May 14, 2020, 8:48pm

@littleskunk

Another proposition is enabling compression (lz4), I understand that we have encrypted data, but we hunt for another thing, gaps in writes that slowdown reading in the future. After this procedure, you should reload all data to this disk (copy to another folder is enough)

Krey · May 14, 2020, 8:48pm

this is the only one right possible decision. Keep in mind also that small records can only be done on a single disc, with 8k records will take up many times more space on 3+1 or more. And raids not only for redundancy, but also to combine several disks into one space.

It would also be useful to add to the config parameter about the block size, separate for databases and data. It seems to me that the storj is not optimally reading files.

Krey · May 14, 2020, 8:54pm

Please use right terminology for ZFS. Nor txg pool, zil or slog all of this is not write cache. They have completely other functionality compared with write cache. Incorrect terms confuse and lead to incorrect conclusions.

Pentium100 · May 14, 2020, 8:57pm

And yet, it behaves like a write cache does on a hardware RAID controller.

While zfs has its own names for a lot of things, I still say that I use a RAID6 array instead of a raidz2 pool. 1) It is essentially the same thing but different name 2) it is easier for people who are not familiar with zfs to understand.

Krey · May 14, 2020, 9:00pm

ZIL and SLOG only for sync writes. Write cache for all.
txg pool can combine transactions in one zfs record.
Raid6 rebuilding all space including free space. Raidz2 not.
Raid6 have fixed stripe width, Raidz2 have variable recordsize\blocksize
Raid controller knowns nothing about files. Raidz known about transactional filesystem above block devices.

whats why different terms are used to describe different entities.

Pentium100 · May 14, 2020, 9:09pm

Both have two parity drives, which is usually the more important distinction from other types (RAID1, RAID5 etc).

I can either say “I use a raid6 array” or I can say “I use zfs raidz2, which is the same as a raid6 array - two parity drives, but has variable record size and only needs to rebuild the used data”. The second is more technically correct, but the first one is good enough.

RAID controller cache is mostly used for sync writes as well, since async writes first go to system memory. The difference is that async writes also end up on the cache when they are finally written to disk, but when we are talking about the performance impact, sync writes matter more.
So, yes, it is not exactly the same, but it is similar enough to be able to talk with someone and not have to explain how zfs works in detail.

Yes, this is a feature that zfs has and a RAID controller doesn’t.

littleskunk · May 14, 2020, 9:13pm

I am not planning on getting a certification. I am happy that it works and stick with the easy term “write cache”. That way I can later switch to maybe BRTFS and don’t need to care if they use a different term for the same functionality.

Krey · May 14, 2020, 9:17pm

this is why i for properly use terms. Nor Raid6 or Raidz2 have parity drives. This have raid3 and raid4.
Your approach is when you play a trick on you. Neither you are the first nor you are the last unfortunately

SGC · May 14, 2020, 9:23pm

yeah that came to mind when i decided i should go with 32k because of my 8k zvol blocksize, that way a 32k record stripes across my 4 drives and +1 for the parity, so that should fit nicely for my system…
even if 16k records seemed to give me the best performance on the node… because it kinda dropped a bit after i changed it, but i bet thats just temporary flux.
however i duno how that would work between the vdevs… because im running an uneven set of vdevs… 4 drives on one and 5 on the other both in raidz1
a bit of a performance setup, i know i should run 5 and 5 but i only had 4 of those drives…
from what i can remember vdevs load balance between each other… so essentially the one biggest vdev is optimized because of 32k recordsizes, and then the other would just be well every 3 full record stripes that makes 4 stripes across the raidz1 on 4 drives which should be fine to my understanding.

this stuff isn’t a science… lol it’s a f’king art form lol
Voodoo i tell yee…

Pentium100 · May 14, 2020, 9:35pm

And there is no difference when talking about disk failures. Dedicated parity vs distributed parity difference is in performance, but both tolerate the same amount of failed drives. And since nobody uses raid3 and 4 anymore, it doesn’t matter, because performance comparisons will be made with raid10, 50, 60 etc.

And again “two parity drives” is shorter than “distributed parity to tolerate two drive failures” and just as good.

Krey · May 14, 2020, 9:39pm

Did you use zvol or datasets? Where must be only one volblocksize for zvol or recordsize for datasets.

this only add some gap on padings (effectiveness of space using)

Raidz1 say 3+1 or 4+1 is ONE vdev. Technically when more devices in RAIDz when more fast vdev (on sequentially reads) but this is not working with small files. ZFS Prefetch algorithms optimized for files 8+ MB.

Pentium100 · May 14, 2020, 9:45pm

For random writes a raidz(1,2,3) vdev is about as fast as a single drive. For sequntial writes it is faster.

Krey · May 14, 2020, 9:51pm

Zfs as transactional filesystem does not have random writes at all.
For reads yes. But storagenode read sequentionally small files.