Now it hit me: 'Your node has been suspended'

Lol.
Just saw it: Your node has been suspended on 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE.

Now what shall I check first?

The disk was a bit in heavy use since yesterday evening. I guess resulting from that I see numerous failed to add bandwidth usage messsages.

What I did so far:

docker logs storagenode 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0
docker logs storagenode 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | wc -l
16

Log start date is on 2/2/22 so sufficient time has passed for the log to be meaningful.

Could it be that 16 failed GET_REPAIRS cause a suspension? What else could I check?

Edit: Here is a weird thing: The dashboard shows the nodes online time as: 4h 0m.

did you have any sort of issues and is the node still suspended on that satellite…

its not to difficult to offend the suspension score but that usually recovers very quickly…
had my system stall out a few days ago and got a few suspensions, but when i noticed and checked it a few days later it was all fine again.

No, no issues, just the heavy disk usage. I am looking aat the logs and the node is running just fine.

Are you kidding me?

outofsuspension

5 minutes ago suspension score was at like below 50%.

1 Like

If you have a huge node already, it’s going up fast, yes. :v:t2:

yeah it surprise me to, apparently there are two stages or kinds of suspensions… one can be really short and is cause by a lot of intermittent issues in a short time, and the score will jump right back.

the other is the one we are more familiar with and usually lasts a lot longer.

don’t really know the exacts of it… i think it was @Alexey that was taking about it at one point.

Yes it could, I would look at what the errors say.

Big node, big satellite. The score can change very quickly with that combination.

It is getting really weird now:

docker logs storagenode 2>&1 | grep -E 'GET_REPAIR' | grep 'failed' | grep '1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE' | wc -l
0

So the errors have not been with the saltlake satellite on which the node got suspended. How is that possible?

That is weird, but regardless you should look at them. Unlike normal transfers, repairs shouldn’t do long tail cancellation, so you shouldn’t see errors for those at all.

Indeed.
Is there something else other than failed I could grep for?

You could look for ERROR, but I think you would get the same lines. It’s possible they get stuck somehow and just never finish.

So, I know you know :slight_smile: but if the docker container restarted, your logs will be reset and the information prior to that lost.

docker ps -a

and look at the uptime for the container, if it’s about the same then your container restarted - #1 suspect would be the out of mem killer, probably due to disk stall.

have a look at

dmesg | grep -i error

see if anything obvious jumps out.

I can also add, last night at 00:00 UTC someone started a huge upload ! my disks nearly took off :smiley: the packets per second jumped from the normal 3 per second, to 98 per second for my node.

It is probably unrelated, but if you had the blip last night, this traffic could of played a part.

#Edit : also your docker, what version ? and what repo are you pulling from ? if it happens to be Debian / Ubuntu there are some breaking changes not back ported yet \o/ - you would be better updating apt repos to point to Docker Repos and pull the latest.

I’ve experienced a similar thing with my latest suspension - huge traffic, directly linked linked to auditing and repairs. Node hang up - suspended. It took 3 days to recover and allow new ingress.

Yes. Nothing meaningful. It must have been stuck resp. offline somehow.

That was a good one: Up only 7 hours for this node. Which of course is not what it should be as I have not restarted it for days. So something must have gone seriously wrong.

Ah yes, this one is good too: dmesg: read kernel buffer failed: Operation not permitted
This is probably it.

20.10.7. I am on Debian but it seems I have pulled this from the original docker.com source.

Hmm - so… If you happen to be running on an LXC container on Proxmox, or another KVM based hypervisor, there is a regression in docker I’m afraid - it been patched in 20.10.12 so upgrade when you can [make sure storj is down]

oh interesting, I thought nothing of it, but the I/O packet graphs shows the anomaly well :smiley: - would be interesting to know if others saw the same thing ?

blip01-1200utc

1 Like

Yup.

A large spike at UTC 0:00 Feb 11

I’m also running about 40% more traffic overall this month so far.

That is a good idea. I will do that soon.

Not here… :frowning_face:

This is not correct. The restart of the container doesn’t wipe logs, only when you re-create the container, i.e.

docker stop -t 300 storagenode
docker rm storagenode
docker run ...

Again?

Your node has been suspended on 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.

us1

docker logs storagenode 2>&1 | grep -E 'GET_AUDIT' | grep 'failed' | wc -l
0

C’mon Storj what’s wrong?

Did you check GET_REPAIR as well?