I got nodes that seem to shut themselves down when encountering high iowait’s

penfold · September 3, 2021, 9:32am

I thought zfs was a memory hog by nature? Wasn’t the rule 1GB for 1TB storage?

penfold · September 3, 2021, 9:33am

I just got another server that doesn’t have storj on it yet. Apart from replacing failed bits I leave the storj box alone.

SGC · September 3, 2021, 10:04am

i got 114GB RAM mostly just so ZFS can do some proper caching…
and 1TB L2ARC for metadata and SLOG
there is a lot of stuff on the pool, but yes ZFS likes memory but it also has some great performance.

SGC · September 3, 2021, 10:08am

i recently found out my containers root drive was located on the OS boot drive which is a sad little thing, so tried moving them to the pool but that didn’t seem to fix the problem… so next i’m trying to use an old enterprise intel ssd which i was planning on using for the OS but haven’t gotten around to setting that up…

so it will be dedicated for the container and thus docker images and such… maybe that will help…

atleast its good with a challenge lol

kevink · September 3, 2021, 10:10am

I have a 32GB Desktop PC, 10GB tied up in Windows VM, 6 nodes with a total of 14TB all running on 3 drives in a single esata enclosure. Filewalker is crazy and the enclosure reconnects all drives multiple times during that process which always takes around 10 seconds during which the hdd is unreachable. Iowait is multiple seconds all the time.
I even had some file corruption which zfs found during the scrubs. (But I always deleted those files afterwards but I only run a scrub like at most once a month)

However, I never had the problem that a node would crash.
Just throwing this in as a comparison to a rather unreliable and less powerful system without having read everything.

SGC · September 3, 2021, 12:59pm

the dedicated ssd for the container root drive was also a bust…
seems slightly better, but not sure if thats just me being an optimist lol

to counter the main issue tho, i revised my run command and removed the restart unless stopped part, since it seems to be part of what causes the majority of the problems with overload starting a cascading effect of nodes resetting other nodes…

i just don’t get why this is happening,

the pool doesn’t even seem that stressed, everything else seems to work pretty okay on it…

bitpool      total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0      0
511ns           0      0      0      0      0      0      0      0      0      0
1us             0      0      0      0      0      0      0      0      0      0
2us             0      0      0      0     31     91     32    255      0      0
4us             0      0      0      0    658    648    235  1.10K      1      0
8us             0      0      0      0    124     12     22    204      0      0
16us            0      0      0      0      4      1      1     22      0      0
32us            0      2      0      7      4      0      1     15      0      0
65us            0    459      0    475      2      0      0      7      0      0
131us           0    258      0    244      0      0      0     11      0      0
262us           2    433      4    483      0      0      0     19      0      0
524us           6    428      9    491      0      0      0     23      0      0
1ms             2    284      5    322      1      0      0     27      0      0
2ms             1    205      6    230      1      0      1     33      0      0
4ms             8    118     17    131      3      0      3     38      0      0
8ms            90    109    108    121      5      0      7     49      0      0
16ms          345    113    367    167      6      0     13     62      0      0
33ms          374    114    392    223     10      0     11     79      0      0
67ms          233    146    240    142     13      0      6    102      1      0
134ms         116    182    114    122     15      0      3    120      2      0
268ms          57    170     37     50     18      0      0    119      4      0
536ms          36    119     10     16     15      0      0     88      7      0
1s             22     52      4      5      4      0      0     40     11      0
2s             14     23      1      1      1      0      0     19     11      0
4s              8     11      0      0      0      0      0     10      6      0
8s              0      3      0      0      0      0      0      3      0      0
17s             0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

600 second zpool iostats -w

what it seems to me is that somebody tried to mitigate hdd timeouts, while not being aware that some data can actually sit in memory for up to a few minutes before being written to disk depending on the levels of query.
async writes aren’t forced to the physical media until there is time for it in the workflow or room in the hdd cache.

so somebody may have set some sort of limitation in say 10 seconds or something which then tries to reset the node…

ofc this doesn’t really correlate with the go code showing up in the logs, as if this was intended programming regular log output would be expected.

duno what else could cause this… now with the container root drive being on a dedicated ssd and the drive runs flawlessly not a hint of latency using newly downloaded storagenode images, still has the exact same issue when the pool gets just medium loaded.

SGC · September 7, 2021, 2:07pm

finally figured this out.

i had a bios feature called coarse clock grating which i had turned off, so it would be fine which should be better for overall cpu utilization.

apparently that came at the cost of my cpu stalling out when exposed to high iowait from my storage solution.

this caused the OS to get hickups with all kinds of stuff unable to get access to various resources at nearly random times, docker or the storagenodes was highly sensitive to this and would restart themselves causing more iowait because of the filewalker.

so all in all it was a cpu core/thread utilization / configuration issues.

penfold · September 7, 2021, 2:45pm

Congrats on working it out.

SGC · September 7, 2021, 2:59pm

took so long, might also have been the extra traffic we have been seeing.
and did get a lot of optimization done which has been on the waiting list for a long time, so atleast i got something out of the time lol

BrightSilence · September 7, 2021, 3:34pm

Oof, well, it doesn’t surprise me that that took a while to find. Gotta love those cascading effects of pretty obscure settings. Nice find!

SGC · September 7, 2021, 4:34pm

always fun to try to fix one problem and then create another one only really discovers 60 days later…

think i’m going to start keeping a log of my changes.