I thought zfs was a memory hog by nature? Wasn’t the rule 1GB for 1TB storage?
I just got another server that doesn’t have storj on it yet. Apart from replacing failed bits I leave the storj box alone.
i got 114GB RAM mostly just so ZFS can do some proper caching…
and 1TB L2ARC for metadata and SLOG
there is a lot of stuff on the pool, but yes ZFS likes memory but it also has some great performance.
i recently found out my containers root drive was located on the OS boot drive which is a sad little thing, so tried moving them to the pool but that didn’t seem to fix the problem… so next i’m trying to use an old enterprise intel ssd which i was planning on using for the OS but haven’t gotten around to setting that up…
so it will be dedicated for the container and thus docker images and such… maybe that will help…
atleast its good with a challenge lol
I have a 32GB Desktop PC, 10GB tied up in Windows VM, 6 nodes with a total of 14TB all running on 3 drives in a single esata enclosure. Filewalker is crazy and the enclosure reconnects all drives multiple times during that process which always takes around 10 seconds during which the hdd is unreachable. Iowait is multiple seconds all the time.
I even had some file corruption which zfs found during the scrubs. (But I always deleted those files afterwards but I only run a scrub like at most once a month)
However, I never had the problem that a node would crash.
Just throwing this in as a comparison to a rather unreliable and less powerful system without having read everything.
the dedicated ssd for the container root drive was also a bust…
seems slightly better, but not sure if thats just me being an optimist lol
to counter the main issue tho, i revised my run command and removed the restart unless stopped part, since it seems to be part of what causes the majority of the problems with overload starting a cascading effect of nodes resetting other nodes…
i just don’t get why this is happening,
the pool doesn’t even seem that stressed, everything else seems to work pretty okay on it…
bitpool total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 0 0 0 0 0 0
511ns 0 0 0 0 0 0 0 0 0 0
1us 0 0 0 0 0 0 0 0 0 0
2us 0 0 0 0 31 91 32 255 0 0
4us 0 0 0 0 658 648 235 1.10K 1 0
8us 0 0 0 0 124 12 22 204 0 0
16us 0 0 0 0 4 1 1 22 0 0
32us 0 2 0 7 4 0 1 15 0 0
65us 0 459 0 475 2 0 0 7 0 0
131us 0 258 0 244 0 0 0 11 0 0
262us 2 433 4 483 0 0 0 19 0 0
524us 6 428 9 491 0 0 0 23 0 0
1ms 2 284 5 322 1 0 0 27 0 0
2ms 1 205 6 230 1 0 1 33 0 0
4ms 8 118 17 131 3 0 3 38 0 0
8ms 90 109 108 121 5 0 7 49 0 0
16ms 345 113 367 167 6 0 13 62 0 0
33ms 374 114 392 223 10 0 11 79 0 0
67ms 233 146 240 142 13 0 6 102 1 0
134ms 116 182 114 122 15 0 3 120 2 0
268ms 57 170 37 50 18 0 0 119 4 0
536ms 36 119 10 16 15 0 0 88 7 0
1s 22 52 4 5 4 0 0 40 11 0
2s 14 23 1 1 1 0 0 19 11 0
4s 8 11 0 0 0 0 0 10 6 0
8s 0 3 0 0 0 0 0 3 0 0
17s 0 0 0 0 0 0 0 0 0 0
34s 0 0 0 0 0 0 0 0 0 0
68s 0 0 0 0 0 0 0 0 0 0
137s 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
600 second zpool iostats -w
what it seems to me is that somebody tried to mitigate hdd timeouts, while not being aware that some data can actually sit in memory for up to a few minutes before being written to disk depending on the levels of query.
async writes aren’t forced to the physical media until there is time for it in the workflow or room in the hdd cache.
so somebody may have set some sort of limitation in say 10 seconds or something which then tries to reset the node…
ofc this doesn’t really correlate with the go code showing up in the logs, as if this was intended programming regular log output would be expected.
duno what else could cause this… now with the container root drive being on a dedicated ssd and the drive runs flawlessly not a hint of latency using newly downloaded storagenode images, still has the exact same issue when the pool gets just medium loaded.
finally figured this out.
i had a bios feature called coarse clock grating which i had turned off, so it would be fine which should be better for overall cpu utilization.
apparently that came at the cost of my cpu stalling out when exposed to high iowait from my storage solution.
this caused the OS to get hickups with all kinds of stuff unable to get access to various resources at nearly random times, docker or the storagenodes was highly sensitive to this and would restart themselves causing more iowait because of the filewalker.
so all in all it was a cpu core/thread utilization / configuration issues.
Congrats on working it out.
took so long, might also have been the extra traffic we have been seeing.
and did get a lot of optimization done which has been on the waiting list for a long time, so atleast i got something out of the time lol
Oof, well, it doesn’t surprise me that that took a while to find. Gotta love those cascading effects of pretty obscure settings. Nice find!
always fun to try to fix one problem and then create another one only really discovers 60 days later…
think i’m going to start keeping a log of my changes.