Failing suspension audits because of node restart

I got notified that a new version is available and restarted my node to update.
Then I noticed that I do not have tcp fastopen enabled, put the required setting in sysctl.conf and restarted the node again.
Then I noticed that fastopen is still disabled, modified the docker run command and restarted the node again.

Node was first stopped at 2023-04-14 14:01:03
Second time at 2023-04-14 14:01:39
And third time at 2023-04-14 14:05:43

This got my suspension audit score on us1 down to 0.965.

According to the logs, until the first time my node was stopped, it got 3655 audits from that satelite, all successful.
After the first restart, it got 0 audits (it was running for 20 seconds or so, not surprisint).
After the second restart, it got 4 audits, all successful
After the third restart until the time I am writing this), it got 17 audits, all successful.
I assume the audits were successful since the count of “download started” and “downloaded” lines match.

So, I got nearly suspended for restarting my node three times in quick succession? Really? Isn’t it a bit too harsh?

You probably wouldn’t be suspended even if your suspension score went below 0.96. I was there, had a similar case, nothing happend and hours later it was back at 1.

There are different cases which will affect your score but not necessarily lead to suspension. If you node responds with gibberish, not success/failure/offline, It’s “case 5”.
For example your node is online and can respond but the data is not accessible is such a case. It can respond but not with correct nor incorrect data.
See case 5:

this will lead to the containment mode and after two more attempts to audit the same piece with 5 minutes timeout each this audit considered as failed (and it will affect an audit score, not the suspension score).
So, you described the p.4

The suspension score is affected when none of 1-4 cases in place, like the corrupted database when the audit got requested, or your node forcible closed a connection during audit, etc.
So there is almost immediate response on audit with unexpected error.

Are you sure about this? I once had unaccessible data but audit score wasn’t affected only suspension score went down.
Case 4 reads like there is data transferred but connection lost while doing so.

On the other hand

I understand exactly like this would be a case 4

Yes. The suspension score is affected when the node answer on audit request and return an unknown error. If the node finally provided a piece for audit after 3 attempts, the audit will be considered as passed.
Please note, the suspension score may be affected independently of audit score and they could be affected both.
Known errors are: file not found, piece is corrupted, node answer but cannot provide a piece after 3 attempts with 5 minutes timeout each.

this will lead to timeout error on the checker’s side and your node will be placed into a containment mode to be asked for the same piece two more times.

no, the checker will receive a “context canceled, connection forcible closed by the remote host”, this will affect a suspension score and your node will be placed into a containment mode.

1 Like

So it should be better to cut the connection than have an unaccessable file. For example if the drive would disconnect and the node is still running, thats all audit failiures? Or is there a difference between failing satellite audits and responding “file not found” to client requests?

If that’s the case I would also like to know what could have impacted the suspension score and not audit, as asked in this thread.

Depends on what your OS would report - if “file not found”, then - yes, otherwise it likely will affect a suspension score and will place your node into a containment mode. But then the writeability/readability checkers will crash your node, even if your OS would partially hangs because of the not available disk.
We recently added timeouts to these checkers to help to avoid disqualification exactly for cases when the OS partially hangs instead of throwing a error. Before these checkers did not have a timeout and hung together with the OS, but since the node continued to respond to audit requests but could not give a piece even after 3 attempts, each such audit failed with some delay and after a few hours the node became disqualified.

anything what returns an unknown error on audit request.

I’ve seen it in the changelogs and it helps but it’s not very reliable.
Out of 4 drive disconnects the node chrashed only on 3 and the fourth time it kept going with constant log entries like “drive not found / not accessible / can’t read / write”. Don’t remember the exact wording. This was on 1.75, didn’t test on 1.76. And the result was an suspension score below 0.96 for a couple of hours but 1 on audit. That’s why I thought it was case 5.

Because only suspension score went down the anwser to the reason what happend, also to OP, is “unknown”?

the timeout was implemented since 1.75.2

it’s case 5, because these errors are not known to auditors:

they are unknown to auditors, but not to humans :slight_smile: