Now it hit me: 'Your node has been suspended'

jammerdan · February 11, 2022, 9:37am

Lol.
Just saw it: Your node has been suspended on 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE.

Now what shall I check first?

The disk was a bit in heavy use since yesterday evening. I guess resulting from that I see numerous failed to add bandwidth usage messsages.

What I did so far:

docker logs storagenode 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0

docker logs storagenode 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | wc -l
16

Log start date is on 2/2/22 so sufficient time has passed for the log to be meaningful.

Could it be that 16 failed GET_REPAIRS cause a suspension? What else could I check?

Edit: Here is a weird thing: The dashboard shows the nodes online time as: 4h 0m.

SGC · February 11, 2022, 9:55am

did you have any sort of issues and is the node still suspended on that satellite…

its not to difficult to offend the suspension score but that usually recovers very quickly…
had my system stall out a few days ago and got a few suspensions, but when i noticed and checked it a few days later it was all fine again.

jammerdan · February 11, 2022, 9:58am

No, no issues, just the heavy disk usage. I am looking aat the logs and the node is running just fine.

jammerdan · February 11, 2022, 10:09am

Are you kidding me?

outofsuspension

5 minutes ago suspension score was at like below 50%.

Bivvo · February 11, 2022, 10:33am

If you have a huge node already, it’s going up fast, yes.

SGC · February 11, 2022, 11:09am

yeah it surprise me to, apparently there are two stages or kinds of suspensions… one can be really short and is cause by a lot of intermittent issues in a short time, and the score will jump right back.

the other is the one we are more familiar with and usually lasts a lot longer.

don’t really know the exacts of it… i think it was @Alexey that was taking about it at one point.

BrightSilence · February 11, 2022, 11:34am

Yes it could, I would look at what the errors say.

Big node, big satellite. The score can change very quickly with that combination.

jammerdan · February 11, 2022, 11:44am

It is getting really weird now:

docker logs storagenode 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | grep '1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE' | wc -l
0

So the errors have not been with the saltlake satellite on which the node got suspended. How is that possible?

BrightSilence · February 11, 2022, 12:12pm

That is weird, but regardless you should look at them. Unlike normal transfers, repairs shouldn’t do long tail cancellation, so you shouldn’t see errors for those at all.

jammerdan · February 11, 2022, 12:15pm

Indeed.
Is there something else other than failed I could grep for?

BrightSilence · February 11, 2022, 12:37pm

You could look for ERROR, but I think you would get the same lines. It’s possible they get stuck somehow and just never finish.

Bivvo · February 11, 2022, 12:56pm

I’ve experienced a similar thing with my latest suspension - huge traffic, directly linked linked to auditing and repairs. Node hang up - suspended. It took 3 days to recover and allow new ingress.

jammerdan · February 11, 2022, 12:59pm

Yes. Nothing meaningful. It must have been stuck resp. offline somehow.

That was a good one: Up only 7 hours for this node. Which of course is not what it should be as I have not restarted it for days. So something must have gone seriously wrong.

Ah yes, this one is good too: dmesg: read kernel buffer failed: Operation not permitted
This is probably it.

20.10.7. I am on Debian but it seems I have pulled this from the original docker.com source.

anon27637763 · February 11, 2022, 1:12pm

Yup.

A large spike at UTC 0:00 Feb 11

I’m also running about 40% more traffic overall this month so far.

jammerdan · February 11, 2022, 1:17pm

That is a good idea. I will do that soon.

Not here…

Alexey · February 12, 2022, 6:03am

This is not correct. The restart of the container doesn’t wipe logs, only when you re-create the container, i.e.

docker stop -t 300 storagenode
docker rm storagenode
docker run ...

jammerdan · February 19, 2022, 4:40pm

Again?

Your node has been suspended on 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.

us1

docker logs storagenode 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0

C’mon Storj what’s wrong?

BrightSilence · February 19, 2022, 5:47pm

Did you check GET_REPAIR as well?

karacurt · February 19, 2022, 6:04pm

Nice to get the node suspension in Saturday guys!
As far as I see I’m not alone with this message today.
Maybe yougot some troubles from a Storj side?

jammerdan · February 19, 2022, 10:09pm

Wow. 3 nodes now suspended:

Node1:

docker logs storagenode1 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0
docker logs storagenode1 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | wc -l
2

Node2:

docker logs storagenode2 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0
docker logs storagenode2 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | wc -l
2

Node3:

docker logs storagenode3 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0
docker logs storagenode3 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | wc -l
1

A bloodbath unfolding?