I got nodes that seem to shut themselves down when encountering high iowait’s

SGC · August 28, 2021, 7:28am

i got nodes that seem to shut themselves down when encountering high iowait’s
i assume thats part of the new “to me atleast” codes / node survival measures.?

unsure if there has been any real announcement about this, took me like 8 hours before i noticed it was down, reboots are such a hassle lol

BrightSilence · August 28, 2021, 8:07am

I highly doubt that. It would definitely have been mentioned in a change log if something like that were implemented. I’d look into that if I were you.

SGC · August 28, 2021, 8:49am

yeah it was acting weird, not sure why… had just done a cold start of the server, i think it was because my nodes was running poorly which made the nodes take more memory, which made the pools have less ARC so it because self enforcing until it was all just basically stalled.

i mean i’ve never seen docker restart unless stopped actually fail to start a node, and it happened to two nodes, while others took like 4+ hours if not 8 before they came back online.
maybe it simply gives up after a while…

had to have been some kind of code somewhere, but it could ofc have been docker

the really weird thing is i can usually spin up in a couple of hours, which makes me wonder why it took like 16 hours this time.

never seen the nodes start up like that before… but maybe i’m just getting closer to the limits of my setup, having been thinking about adding more iops

and cache

Alexey · August 28, 2021, 8:53am

Please, search for “unexpected shutdown” in your logs.

SGC · August 28, 2021, 9:12am

have some unexpected EOF and they number the same as my unexpected matches.
so no unexpected shutdown in the log from yesterday when it happened.

i ofc also searched for unexpected shutdown, but didn’t get any results so wanted to be sure there wasn’t some capital letters or something…

SGC · August 28, 2021, 9:15am

i have a ton of this weirdness

storj.io/drpc/drpcmux.(*Mux).HandleRPC(0xc00012e130, 0x13617a0, 0xc002666b80, 0xc0041983e0, 0x1d, 0x1, 0x1)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmux/handle_rpc.go:33 +0x286
storj.io/common/rpc/rpctracing.(*Handler).HandleRPC(0xc00021ab80, 0x1361860, 0xc00556b1e0, 0xc0041983e0, 0x1d, 0xc002b38ef8, 0x1252930)
	/go/pkg/mod/storj.io/common@v0.0.0-20210708125041-4882a3ae3eda/rpc/rpctracing/handler.go:58 +0x17b
storj.io/drpc/drpcserver.(*Server).handleRPC(0xc0004a2c30, 0xc00556b1e0, 0xc0041983e0, 0x1d, 0xc0041983e0, 0x1d)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:102 +0x62
storj.io/drpc/drpcserver.(*Server).ServeOne(0xc0004a2c30, 0x135f320, 0xc000817050, 0x7ff63e811130, 0xc0015bae00, 0x0, 0x0)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:60 +0x1af
storj.io/drpc/drpcserver.(*Server).Serve.func2(0x135f320, 0xc000817050)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:95 +0x6a
storj.io/drpc/drpcctx.(*Tracker).track(0xc000817050, 0xc00045bfa0)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcctx/transport.go:51 +0x3c
created by storj.io/drpc/drpcctx.(*Tracker).Run
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcctx/transport.go:46 +0x66

goroutine 48501 [select]:
storj.io/drpc/drpcmanager.(*Manager).manageStreamContext(0xc008c9b0e0, 0x135eca0, 0xc009478a80, 0xc009196630, 0xc003cc7d40)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmanager/manager.go:387 +0x165
created by storj.io/drpc/drpcmanager.(*Manager).manageStream
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmanager/manager.go:344 +0xe5

figured i would take a direct look at the event in the logs… seems like it was doing that kind of stuff then working for a bit and then going back to go code log entries

SGC · August 28, 2021, 9:21am

2021-08-27T12:08:59.934Z	INFO	piecestore	upload started	{"Piece ID": "65QFT7Q2KDON6MLUY3WLC6GBQBYGQNFLOCCAMXPXDUFNYSVI5ZMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1483276003251}
2021-08-27T12:08:59.963Z	INFO	piecestore	upload started	{"Piece ID": "VQ2KSWTYMMUUWMRGZIP4FNC5YET3K4QEYQSSFNF7P7YKNOREEXOQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 1483276003251}
storj.io/drpc/drpcmux.(*Mux).HandleRPC(0xc00012e130, 0x13617a0, 0xc002666b80, 0xc0041983e0, 0x1d, 0x1, 0x1)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmux/handle_rpc.go:33 +0x286
storj.io/common/rpc/rpctracing.(*Handler).HandleRPC(0xc00021ab80, 0x1361860, 0xc00556b1e0, 0xc0041983e0, 0x1d, 0xc002b38ef8, 0x1252930)
	/go/pkg/mod/storj.io/common@v0.0.0-20210708125041-4882a3ae3eda/rpc/rpctracing/handler.go:58 +0x17b
storj.io/drpc/drpcserver.(*Server).handleRPC(0xc0004a2c30, 0xc00556b1e0, 0xc0041983e0, 0x1d, 0xc0041983e0, 0x1d)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:102 +0x62
storj.io/drpc/drpcserver.(*Server).ServeOne(0xc0004a2c30, 0x135f320, 0xc000817050, 0x7ff63e811130, 0xc0015bae00, 0x0, 0x0)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:60 +0x1af
storj.io/drpc/drpcserver.(*Server).Serve.func2(0x135f320, 0xc000817050)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcserver/server.go:95 +0x6a
storj.io/drpc/drpcctx.(*Tracker).track(0xc000817050, 0xc00045bfa0)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcctx/transport.go:51 +0x3c
created by storj.io/drpc/drpcctx.(*Tracker).Run
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcctx/transport.go:46 +0x66

went from working to that…

not sure what that is about… never seen that stuff in the logs before… will have to check my logs and get some data on how often this has been happening.

it resumes 13 minutes later with this

goroutine 48419 [select]:
storj.io/drpc/drpcmanager.(*Manager).manageStreamContext(0xc008c9aea0, 0x135eca0, 0xc009135110, 0xc0090980f0, 0xc003a95520)
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmanager/manager.go:387 +0x165
created by storj.io/drpc/drpcmanager.(*Manager).manageStream
	/go/pkg/mod/storj.io/drpc@v0.0.23/drpcmanager/manager.go:344 +0xe5

rax    0x0
rbx    0x6
rcx    0xf9ee41
rdx    0x0
rdi    0x2
rsi    0x7ff638456930
rbp    0x7ff638456930
rsp    0x7ff638456918
r8     0xa
r9     0x19a7fac
r10    0x8
r11    0x246
r12    0x7ff63ebcf0a0
r13    0x0
r14    0x1321d88
r15    0x0
rip    0xf9ee41
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
2021-08-27T12:22:48.857Z	INFO	Configuration loaded	{"Location": "/app/config/config.yaml"}

is this some sort of debug prelaunch log stuff?

also this day is a terrible example i was doing a ton of mean stuff to my network, doing reconfiguration, and didn’t turn off the nodes… so connection will have been spotty and network acting weird.

Alexey · August 28, 2021, 9:40am

I saw something like that before. It’s usually happens if the docker image got corrupted for any reason (my raspberry pi sometimes got it). But it suggests problems with the disk, where docker is storing images (in case of my raspberry - it’s microSD card).
I would recommend to stop and remove all storagenode containers and remove the image, then pull it again.
After that try to start your nodes and check logs.

SGC · August 28, 2021, 11:44am

no there is nothing wrong now and no data loss, simply to much latency / iowait during my server boot…
works fine now… i suppose maybe that was what stopped the docker image… it timed out lol or something

the nodes are running fine now after all the filewalking has finished…

for once which never happened before i peaked out my cpu’s with iowait
maybe thats why it performed so oddly and slowly

i suppose if a node was filewalking and the image times out and it stalls… then restarts and restart filewalking it might never get done before other tasks on the system lessen the load

Alexey · September 2, 2021, 7:33pm

Have you tried this suggestion?

SGC · September 2, 2021, 7:52pm

yeah i moved the entire container root drive to the pcie ssd accelerated root drive.

it’s like there is some sort of latency based timeout that didn’t use to be there and then it just kills the node and because that happens the filewalker start running which knocks more nodes offline

i could normally do 99.9% successrates
now i can’t even boot
and i have been going through most of my system over the last 24 hours fighting with this again because i had to reboot again…

doesn’t seem like anything is wrong aside from a bit of latency during boot.
and i do have two resilvers running which doesn’t help… but cancelling them doesn’t really fix the issue and just puts off how long it will take me to add more raw iops

the filewalker is a monster that should be controlled.

Alexey · September 2, 2021, 8:11pm

… or do not use RAID

SGC · September 2, 2021, 6:55pm

well they clearly changed something in the timeout my storagenodes basically just reboot randomly.

i know i don’t have any corrupt files or bad sectors, unless if the storagenode created them corrupt.

BrightSilence · September 2, 2021, 8:35pm

Instead of jumping to conclusions, why not search the logs? Any intentional restart would certainly come with a log line that describes why.

Nobody is perfect and no system is perfect.

SGC · September 2, 2021, 8:46pm

could be a change to docker or linux for all i know…
what i was trying to point out is that, i doubt the issue asked solved here is related to corrupt files.

ZFS comes pretty close when it comes to data integrity, and it would certainly have told me if there was an issue with checksums on anything.

BrightSilence · September 2, 2021, 8:48pm

Have you actually checked the logs at the time of restart though? No point taking stabs in the dark without doing that.

SGC · September 2, 2021, 8:52pm

it writes go code not normal log lines…

BrightSilence · September 2, 2021, 9:03pm

Ahh, I think you are referring to the stack trace you posted elsewhere. That’s a sign of an unhandled exception. It’s the code doing something it isn’t supposed to do. Definitely confirms it’s not an intended restart. Storj has been really good about catching and handling errors. So this is often a sign of corruption, though not necessarily always. Could also be a memory issue.

It is certainly unrelated to this suggestion though. Since this suggestion is referring to a handled exception with a very specific error message. (Message was moved, so this part no longer applies)

SGC · September 3, 2021, 8:15am

also on ECC memory and running periodic scrubs on them… but i suppose i should find the memory scrub log outputs.

haven’t really had any issues with the server’s memory but i don’t really trust it because it takes very little to make it drop some modules… not sure why, always been like that and so long as i don’t touch the memory in the server, it seems stable.

so it’s possible… however that wouldn’t explain why its directly related to iowait caused by the zfs pool, if there is to much of it… so the cpu utilization hits 100% with lets say 60-80% iowait
it will start happening… lower the iowait and it goes away… raise it again and even stable nodes are not safe from random reboots.

also seems to have gotten worse the last, but my system is also rather busy at the moment, got two disk replacements resilvers going, but it was already a problem before that but i think last reboot i was scrubbing, maybe i need to plan it better… but haven’t had any issues with this in the past, nodes would just spin up and in a couple hours they would be running fine.

my current solution is just to slowly ease the nodes into operation, then they work fine.

just that crazy filewalker always running at boot that makes problems for me.

another interesting thing is that i don’t even need to restart the nodes, the disruption ofc resets them, but if the iowait is low enough they just keep running.

ofc i’m continually upgrading and changing stuff on my server, so it can at times be difficult to figure out if it was something i did, which is sort of what i’ve been trying to figure out for like a week, and ofc nodes grow… so that also puts more load on the server.

BrightSilence · September 3, 2021, 9:22am

There could be some correlation as IO wait causes the node to use much more RAM while waiting for data to be written to disk. I’m not saying that’s what it is, I just wouldn’t rule it out at this point.

I do think that a lot of these issues are compounded because you run RAID. If you didn’t the filewalker of each individual node would just hit that individual disk. It would have less to do too and wouldn’t be interrupted by anything else going on on other disks. It wouldn’t be impacted by resilvers or parity checks either.
Now I’m not one to talk as I have a 2 nodes running on spare space on arrays as well, but I don’t seem to have had the issues you have (not even while using the same array to plot for chia). In the end I think really less is more with Storj setups and it seems to me that the added complexity of your setup causes more problems than it solves.