In this screenshot you can see a difference in IOPS depending on the recordsize. Interesting also that the backlog increases visibly with 1M recordsize.
The real effect on the node however is not so obvious.
to my understand its either IO or bandwidth⦠the issue with big recordsizes is that if you are to load 1000 1M recordsize files into cache then in some cases it will actually take up 1M and thus limit the throughput or max capacity⦠so like say you are checking metadata⦠does each metadata record then take up 1mb in memoryā¦
i saw one of the zfs developers talk about this in a lecture⦠i can see if i cannot find it sometime, you can do large recordsizes, but its by far not always advantageous⦠just like tiny little recordsizes are not always advantageous either⦠it depends a lot on the workload you are doingā¦
you want it to run somewhere sensible⦠16k seems low to me, but iām testing it because i would have changed but for some reason i didnāt or it affected my disk latency when i was working on the problematic drive yesterday, and just earlier i did turn it off again while resilvering i think it was⦠and my latency immediately went up⦠so i turned it back down to 16k from 64k
my main worry with 16k is how much space i will end up using for checksums⦠because i cannot remember exactly how big the zfs checksums are ⦠and i donāt think the checksums change depending on the recordsizes⦠so with 16k vs 128k you will have 8 times the checksum data⦠could become significantā¦
but then it also means everything is chopped into smaller bit⦠which means the computer can stop move on to something else and then comeback to what the computer where doing in less time that it might transfer a 128k record
there isnāt one right answer, i can depend on your disks, your system, your use case⦠what you likeā¦tho there are some very physical limits at either end of the spectrum which is why there exists a lockout for going above 1mb recordsizes⦠it can fuck up your performance to unreasonable heights by tinkering with itā¦
from 16K to 1M well the system would run most of the time⦠and be slow or really slow at some special tasks⦠while at 16M your system could basically stall so hard it might take years to complete a task that could take a week in 16K
128K has been deemed the generally most balanced point⦠for throughput, and I/O, so thats the middle of the scale⦠1M is like 8 times faster at throughput with low IO, but again you run into some hardlimits of you cannot really push though ⦠well i can do 4000 reads on my zpool⦠i cannot push through 4000M so i stand to benefit little aside from slightly lower IO for moving that throughputā¦
and so i can move all i can with lets say 650 IO, then i still have over 3k left⦠so that moved 650 files pr sec⦠but if the files where smaller than 1M then i would still use 1M bandwidth to send them over the bus⦠in most cases⦠so if i instead was running 128k then i would get 8 times the files a second, so long as they will fit inside a 128k recordā¦thus i could move my max IO at about 4k files a sec instead of 650 fileā¦
i think i finally manage to explain it right there⦠and the same goes for memory⦠if you want to put 4k files in memory and change them⦠then you have to uncompress them, thus a record takes up its fullsize⦠ofc like krey said earlier those might be variableā¦
rings a bell⦠but not sure it aināt just because of compression⦠xD
anyways large or small recordsizes all come at a costā¦its just a matter of which one you are willing to pay or can pay because it doesnāt matter.
oh and itās most likely why your scrubs take so longā¦partially anywaysā¦
Ahhh scrubsā¦
duno why i dreaded this so much⦠its going fine doing scrubbing⦠and such a nice time to try to performance tune a littleā¦
I check file sizes on one node.
Lookup directory /mnt/storj/node04/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/
Found 1181529 entries, 44968 ms
Calculating⦠done 14344 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
512 b 1 KiB 1 KiB [ ] 8766 [ ] 8766 [ ] 8.56 MiB [ ] 8.56 MiB
1 KiB 2 KiB 1.51 KiB [-- ] 188787 [-- ] 197553 [ ] 278 MiB [ ] 286 MiB
2 KiB 4 KiB 2.84 KiB [- ] 119328 [---- ] 316881 [ ] 331 MiB [ ] 618 MiB
4 KiB 8 KiB 5.67 KiB [ ] 67599 [---- ] 384480 [ ] 374 MiB [ ] 992 MiB
8 KiB 16 KiB 11.5 KiB [ ] 47230 [----- ] 431710 [ ] 530 MiB [ ] 1.49 GiB
16 KiB 32 KiB 23 KiB [ ] 43700 [------ ] 475410 [ ] 982 MiB [ ] 2.45 GiB
32 KiB 64 KiB 46.3 KiB [ ] 43658 [------ ] 519068 [ ] 1.93 GiB [ ] 4.37 GiB
64 KiB 128 KiB 92.1 KiB [ ] 42500 [------- ] 561568 [ ] 3.73 GiB [ ] 8.1 GiB
128 KiB 256 KiB 184 KiB [ ] 41466 [------- ] 603034 [ ] 7.28 GiB [ ] 15.4 GiB
256 KiB 512 KiB 367 KiB [ ] 38379 [-------- ] 641413 [ ] 13.4 GiB [ ] 28.8 GiB
512 KiB 1 MiB 735 KiB [ ] 37034 [-------- ] 678447 [ ] 26 GiB [ ] 54.8 GiB
1 MiB 1.5 MiB 1.24 MiB [ ] 20642 [-------- ] 699089 [ ] 24.9 GiB [- ] 79.7 GiB
1.5 MiB 2 MiB 1.78 MiB [ ] 65221 [--------- ] 764310 [- ] 114 GiB [-- ] 193 GiB
2 MiB 2.5 MiB 2.21 MiB [----- ] 416195 [---------------] 1180505 [------------ ] 898 GiB [---------------] 1.07 TiB
Total 1180505 files 1.07 TiB bytes, avg size 487 KiB
Lookup directory /mnt/storj/node04/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/
Found 957628 entries, 6479 ms
Calculating... done 17238 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
512 b 1 KiB 1 KiB [ ] 35 [ ] 35 [ ] 35 KiB [ ] 35 KiB
1 KiB 2 KiB 1.25 KiB [ ] 4 [ ] 39 [ ] 5 KiB [ ] 40 KiB
256 KiB 512 KiB 355 KiB [ ] 1 [ ] 40 [ ] 355 KiB [ ] 395 KiB
512 KiB 1 MiB 1021 KiB [- ] 114949 [- ] 114989 [ ] 112 GiB [ ] 112 GiB
1.5 MiB 2 MiB 1.8 MiB [ ] 4212 [- ] 119201 [ ] 7.4 GiB [ ] 119 GiB
2 MiB 2.5 MiB 2.21 MiB [------------- ] 837403 [---------------] 956604 [-------------- ] 1.77 TiB [---------------] 1.88 TiB
Total 956604 files 1.88 TiB bytes, avg size 914 KiB
Lookup directory /mnt/storj/node04/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/
Found 595068 entries, 25135 ms
Calculating... done 10055 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
512 b 1 KiB 1 KiB [ ] 1 [ ] 1 [ ] 1 KiB [ ] 1 KiB
1 KiB 2 KiB 1.29 KiB [ ] 59 [ ] 60 [ ] 76.2 KiB [ ] 77.2 KiB
2 KiB 4 KiB 2.85 KiB [ ] 17 [ ] 77 [ ] 48.5 KiB [ ] 126 KiB
4 KiB 8 KiB 5.47 KiB [ ] 20 [ ] 97 [ ] 110 KiB [ ] 235 KiB
8 KiB 16 KiB 11.4 KiB [ ] 24 [ ] 121 [ ] 272 KiB [ ] 508 KiB
16 KiB 32 KiB 26.6 KiB [ ] 136 [ ] 257 [ ] 3.53 MiB [ ] 4.03 MiB
32 KiB 64 KiB 38.8 KiB [ ] 103 [ ] 360 [ ] 3.91 MiB [ ] 7.94 MiB
64 KiB 128 KiB 95.4 KiB [ ] 94 [ ] 454 [ ] 8.76 MiB [ ] 16.7 MiB
128 KiB 256 KiB 194 KiB [ ] 153 [ ] 607 [ ] 29 MiB [ ] 45.7 MiB
256 KiB 512 KiB 379 KiB [ ] 261 [ ] 868 [ ] 96.7 MiB [ ] 142 MiB
512 KiB 1 MiB 765 KiB [ ] 413 [ ] 1281 [ ] 309 MiB [ ] 451 MiB
1 MiB 1.5 MiB 1.25 MiB [ ] 493 [ ] 1774 [ ] 616 MiB [ ] 1.04 GiB
1.5 MiB 2 MiB 1.76 MiB [ ] 544 [ ] 2318 [ ] 960 MiB [ ] 1.98 GiB
2 MiB 2.5 MiB 2.21 MiB [-------------- ] 591725 [-------------- ] 594043 [-------------- ] 1.25 TiB [-------------- ] 1.25 TiB
4 MiB 8 MiB 6.41 MiB [ ] 1 [---------------] 594044 [ ] 6.41 MiB [---------------] 1.25 TiB
Total 594044 files 1.25 TiB bytes, avg size 896 KiB
Lookup directory /mnt/storj/node04/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/
Found 1671139 entries, 63686 ms
Calculating... done 27637 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
512 b 1 KiB 1 KiB [ ] 10433 [ ] 10433 [ ] 10.2 MiB [ ] 10.2 MiB
1 KiB 2 KiB 1.25 KiB [ ] 8733 [ ] 19166 [ ] 10.7 MiB [ ] 20.9 MiB
2 KiB 4 KiB 3.24 KiB [ ] 1165 [ ] 20331 [ ] 3.69 MiB [ ] 24.6 MiB
4 KiB 8 KiB 5.54 KiB [ ] 7 [ ] 20338 [ ] 38.8 KiB [ ] 24.6 MiB
8 KiB 16 KiB 12.2 KiB [ ] 3 [ ] 20341 [ ] 36.8 KiB [ ] 24.6 MiB
16 KiB 32 KiB 19.7 KiB [ ] 6 [ ] 20347 [ ] 118 KiB [ ] 24.7 MiB
32 KiB 64 KiB 34.8 KiB [ ] 1 [ ] 20348 [ ] 34.8 KiB [ ] 24.8 MiB
64 KiB 128 KiB 115 KiB [ ] 1 [ ] 20349 [ ] 115 KiB [ ] 24.9 MiB
128 KiB 256 KiB 170 KiB [ ] 40031 [ ] 60380 [ ] 6.47 GiB [ ] 6.5 GiB
256 KiB 512 KiB 353 KiB [ ] 7 [ ] 60387 [ ] 2.41 MiB [ ] 6.5 GiB
512 KiB 1 MiB 1021 KiB [- ] 197183 [-- ] 257570 [ ] 192 GiB [ ] 199 GiB
1 MiB 1.5 MiB 1.17 MiB [ ] 1 [-- ] 257571 [ ] 1.17 MiB [ ] 199 GiB
1.5 MiB 2 MiB 1.99 MiB [ ] 1 [-- ] 257572 [ ] 1.99 MiB [ ] 199 GiB
2 MiB 2.5 MiB 2.21 MiB [------------ ] 1412503 [-------------- ] 1670075 [-------------- ] 2.98 TiB [-------------- ] 3.17 TiB
4 MiB 8 MiB 7.23 MiB [ ] 4 [-------------- ] 1670079 [ ] 28.9 MiB [-------------- ] 3.17 TiB
16 MiB 32 MiB 16 MiB [ ] 36 [---------------] 1670115 [ ] 577 MiB [---------------] 3.17 TiB
Total 1670115 files 3.17 TiB bytes, avg size 1.9 MiB
Lookup directory /mnt/storj/node04/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/
Found 585993 entries, 23223 ms
Calculating... done 6048 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
512 b 1 KiB 1 KiB [ ] 7121 [ ] 7121 [ ] 6.95 MiB [ ] 6.95 MiB
1 KiB 2 KiB 1.43 KiB [-- ] 114899 [--- ] 122020 [ ] 161 MiB [ ] 168 MiB
2 KiB 4 KiB 2.85 KiB [- ] 60139 [---- ] 182159 [ ] 168 MiB [ ] 336 MiB
4 KiB 8 KiB 5.74 KiB [- ] 52799 [------ ] 234958 [ ] 296 MiB [ ] 632 MiB
8 KiB 16 KiB 11.4 KiB [- ] 39081 [------- ] 274039 [ ] 436 MiB [ ] 1.04 GiB
16 KiB 32 KiB 23 KiB [ ] 35161 [------- ] 309200 [ ] 791 MiB [ ] 1.81 GiB
32 KiB 64 KiB 46.2 KiB [ ] 34443 [-------- ] 343643 [ ] 1.52 GiB [ ] 3.33 GiB
64 KiB 128 KiB 92.1 KiB [ ] 33563 [--------- ] 377206 [ ] 2.95 GiB [ ] 6.28 GiB
128 KiB 256 KiB 185 KiB [ ] 32866 [---------- ] 410072 [ ] 5.79 GiB [ ] 12.1 GiB
256 KiB 512 KiB 367 KiB [ ] 31569 [----------- ] 441641 [ ] 11.1 GiB [- ] 23.1 GiB
512 KiB 1 MiB 735 KiB [ ] 29400 [------------ ] 471041 [- ] 20.6 GiB [-- ] 43.7 GiB
1 MiB 1.5 MiB 1.23 MiB [ ] 16076 [------------ ] 487117 [- ] 19.4 GiB [--- ] 63.1 GiB
1.5 MiB 2 MiB 1.73 MiB [ ] 10939 [------------ ] 498056 [- ] 18.5 GiB [---- ] 81.6 GiB
2 MiB 2.5 MiB 2.21 MiB [-- ] 86913 [---------------] 584969 [---------- ] 187 GiB [---------------] 269 GiB
Total 584969 files 269 GiB bytes, avg size 483 KiB
Lookup directory /mnt/storj/node04/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
Found 602766 entries, 23144 ms
Calculating... done 6097 ms
1024 directories
Size from Size to Avg size File count Accumulated count File size sum Accumulated size
--------- ------- -------- ---------- ----------------- ------------- ----------------
0 512 b 512 b [ ] 1 [ ] 1 [ ] 512 b [ ] 512 b
512 b 1 KiB 1021 b [ ] 7123 [ ] 7124 [ ] 6.94 MiB [ ] 6.94 MiB
1 KiB 2 KiB 1.45 KiB [-- ] 103511 [-- ] 110635 [ ] 147 MiB [ ] 154 MiB
2 KiB 4 KiB 2.86 KiB [- ] 53669 [---- ] 164304 [ ] 150 MiB [ ] 304 MiB
4 KiB 8 KiB 5.75 KiB [- ] 46882 [----- ] 211186 [ ] 264 MiB [ ] 567 MiB
8 KiB 16 KiB 11.5 KiB [ ] 34890 [------ ] 246076 [ ] 391 MiB [ ] 959 MiB
16 KiB 32 KiB 23 KiB [ ] 31382 [------ ] 277458 [ ] 705 MiB [ ] 1.62 GiB
32 KiB 64 KiB 46.3 KiB [ ] 31098 [------- ] 308556 [ ] 1.37 GiB [ ] 3 GiB
64 KiB 128 KiB 90.9 KiB [ ] 31795 [-------- ] 340351 [ ] 2.76 GiB [ ] 5.75 GiB
128 KiB 256 KiB 207 KiB [- ] 75418 [---------- ] 415769 [ ] 14.9 GiB [- ] 20.7 GiB
256 KiB 512 KiB 331 KiB [- ] 59647 [----------- ] 475416 [- ] 18.8 GiB [-- ] 39.5 GiB
512 KiB 1 MiB 735 KiB [ ] 25836 [------------ ] 501252 [- ] 18.1 GiB [--- ] 57.6 GiB
1 MiB 1.5 MiB 1.23 MiB [ ] 14018 [------------ ] 515270 [ ] 16.9 GiB [---- ] 74.5 GiB
1.5 MiB 2 MiB 1.73 MiB [ ] 9704 [------------- ] 524974 [ ] 16.4 GiB [----- ] 90.9 GiB
2 MiB 2.5 MiB 2.21 MiB [- ] 76768 [---------------] 601742 [--------- ] 165 GiB [---------------] 256 GiB
Total 601742 files 256 GiB bytes, avg size 450 KiB
so i donāt see a profit to optimize storage to Kib files. Mostly used storage at files 2.21MiB at all satellites
Look Accumulated size pips grow
iāve decided to stick with 32k recordsizes for the rest of the month, see how that goes⦠i like the node performance i get while scrubbing on 32k records⦠but who knows⦠takes a long time to really get an idea⦠started at 128k(default) ran 256k for a while because i thought i was smart⦠xD but didnāt really agree with my system, ate to much memory and slowed stuff down⦠then i switched to 16k for a i duno 3-4 days seemed to run really well, but didnāt quite see the internal system bandwidth i wanted while scrubbing and resilvering, so i jumped to 64k for 20 odd minutes which made everything grind down much slower⦠so i set it at 32k and got excellent results on all frontās ⦠so atleast for now thats my recordsize of choice⦠but again this can vary alot from system to system evenā¦
matching directly to filesize i donāt think is a good idea⦠there are many considerations to selecting blocksizes / recordsizes.
If we are lucky we might never have to answer that question. I shared my observation with the developer team today. My previous setup was a QNAP with no option to install a write cache. What is needed to get the best performance out of these boxes? Bigger pieces would help. A long time ago we started with a segment size of 64MB. That are the 2MB pieces we have on disk. Now it is time to evaluate what we have on disk and what our target is. I donāt have details yet but it looks like we want to write bigger pieces to disk.
What happens if I increase my recordsize later? Does that work?
Yes, it will work, but only for new data
only for new writings, at least until you copy the entire dataset. I did this for each entry in my pivot table. It took a long time.
Another proposition is enabling compression (lz4), I understand that we have encrypted data, but we hunt for another thing, gaps in writes that slowdown reading in the future. After this procedure, you should reload all data to this disk (copy to another folder is enough)
this is the only one right possible decision. Keep in mind also that small records can only be done on a single disc, with 8k records will take up many times more space on 3+1 or more. And raids not only for redundancy, but also to combine several disks into one space.
It would also be useful to add to the config parameter about the block size, separate for databases and data. It seems to me that the storj is not optimally reading files.
Please use right terminology for ZFS. Nor txg pool, zil or slog all of this is not write cache. They have completely other functionality compared with write cache. Incorrect terms confuse and lead to incorrect conclusions.
And yet, it behaves like a write cache does on a hardware RAID controller.
While zfs has its own names for a lot of things, I still say that I use a RAID6 array instead of a raidz2 pool. 1) It is essentially the same thing but different name 2) it is easier for people who are not familiar with zfs to understand.
ZIL and SLOG only for sync writes. Write cache for all.
txg pool can combine transactions in one zfs record.
Raid6 rebuilding all space including free space. Raidz2 not.
Raid6 have fixed stripe width, Raidz2 have variable recordsize\blocksize
Raid controller knowns nothing about files. Raidz known about transactional filesystem above block devices.
whats why different terms are used to describe different entities.
Both have two parity drives, which is usually the more important distinction from other types (RAID1, RAID5 etc).
I can either say āI use a raid6 arrayā or I can say āI use zfs raidz2, which is the same as a raid6 array - two parity drives, but has variable record size and only needs to rebuild the used dataā. The second is more technically correct, but the first one is good enough.
RAID controller cache is mostly used for sync writes as well, since async writes first go to system memory. The difference is that async writes also end up on the cache when they are finally written to disk, but when we are talking about the performance impact, sync writes matter more.
So, yes, it is not exactly the same, but it is similar enough to be able to talk with someone and not have to explain how zfs works in detail.
Yes, this is a feature that zfs has and a RAID controller doesnāt.
I am not planning on getting a certification. I am happy that it works and stick with the easy term āwrite cacheā. That way I can later switch to maybe BRTFS and donāt need to care if they use a different term for the same functionality.
this is why i for properly use terms. Nor Raid6 or Raidz2 have parity drives. This have raid3 and raid4.
Your approach is when you play a trick on you. Neither you are the first nor you are the last unfortunately
yeah that came to mind when i decided i should go with 32k because of my 8k zvol blocksize, that way a 32k record stripes across my 4 drives and +1 for the parity, so that should fit nicely for my systemā¦
even if 16k records seemed to give me the best performance on the node⦠because it kinda dropped a bit after i changed it, but i bet thats just temporary flux.
however i duno how that would work between the vdevs⦠because im running an uneven set of vdevs⦠4 drives on one and 5 on the other both in raidz1
a bit of a performance setup, i know i should run 5 and 5 but i only had 4 of those drivesā¦
from what i can remember vdevs load balance between each other⦠so essentially the one biggest vdev is optimized because of 32k recordsizes, and then the other would just be well every 3 full record stripes that makes 4 stripes across the raidz1 on 4 drives which should be fine to my understanding.
this stuff isnāt a science⦠lol itās a fāking art form lol
Voodoo i tell yeeā¦
And there is no difference when talking about disk failures. Dedicated parity vs distributed parity difference is in performance, but both tolerate the same amount of failed drives. And since nobody uses raid3 and 4 anymore, it doesnāt matter, because performance comparisons will be made with raid10, 50, 60 etc.
And again ātwo parity drivesā is shorter than ādistributed parity to tolerate two drive failuresā and just as good.
Did you use zvol or datasets? Where must be only one volblocksize for zvol or recordsize for datasets.
this only add some gap on padings (effectiveness of space using)
Raidz1 say 3+1 or 4+1 is ONE vdev. Technically when more devices in RAIDz when more fast vdev (on sequentially reads) but this is not working with small files. ZFS Prefetch algorithms optimized for files 8+ MB.
For random writes a raidz(1,2,3) vdev is about as fast as a single drive. For sequntial writes it is faster.
Zfs as transactional filesystem does not have random writes at all.
For reads yes. But storagenode read sequentionally small files.