Email telling me node suspended, but everything 100% in dashboard

s.thompson · April 24, 2022, 8:22pm

I received an email just now telling me that my node has been suspended by one of the US satellites because it ‘produced too many errors during audits’. However, when I look on my node dashboard I see all my audits and uptime at 100%.

Is there something wrong either with the satellite to cause this, or is there a condition in which there could be unreported audit failures in my dashboard?

All the storage for my node is on my fileserver, which uses a redundant ZFS filesystem (so any read errors should be transparently found and repaired). I had a drive fail last week, but that was automatically replaced by a hot spare with no data loss/corruption, so that shouldn’t be the cause.

Are there any troubleshooting steps I can do from my end to figure out where the issue actually lies here?

Edit: it seems Asia East has also now suspended me, but still nothing in my dashboard…

baker · April 24, 2022, 8:29pm

Sometimes the score on the dashboard lags behind as this value is reported to the node from the satellites. This only happens when the node checks-in for stats.

You can check your logs for lines with GET_AUDIT and error or GET_REPAIR and error. There have been some problems with nodes entering suspension related to this error:

The node will usually recover quickly from suspension though.

s.thompson · April 26, 2022, 9:52am

Can confirm, no errors in my local log. Also, two days later no anything in the dashboard - everything still at 100%.

Still seeing normal traffic from the satellites which my emails told me had suspended me (which I don’t think would be expected if they really had suspended me).

So I’m concluding that there must be some gremlins somewhere in the system which have incorrectly sent me suspension emails (which now cover three satellites, apparently).

s.thompson · April 26, 2022, 3:44pm

Update: I’ve finally seen something in my dashboard: on two satellites my audit percentage has dropped off to 99.98% (which itself seems odd to me). But this is a far cry from the ‘audits have dropped to the point that I’m being suspended’…

Edit: typo

Alexey · April 26, 2022, 6:21pm

The suspension can happen only for two reasons:

online score below 60%;
suspension score below 60%.

The drop in audit score will not have a suspension, when the audit score drop below 60%, the node will be disqualified.

So, if online score is 100%, then the suspension score dropped below 60%, but if the problem which prevented to pass audit/GET_REPAIR is solved, and the node started to pass audits and provides requested pieces without an error - the suspension score will quickly recover (with each successful audit/repair transfer).

When you check for errors you need to check for GET_AUDIT and GET_REPAIR, since they both affects suspension score (in case of unknown errors during audit or repair transfer) or audit score (in case of known errors like “file not found” or ongoing timeouts).