Failing suspension audits because of node restart

Pentium100 · April 14, 2023, 2:22pm

I got notified that a new version is available and restarted my node to update.
Then I noticed that I do not have tcp fastopen enabled, put the required setting in sysctl.conf and restarted the node again.
Then I noticed that fastopen is still disabled, modified the docker run command and restarted the node again.

Node was first stopped at 2023-04-14 14:01:03
Second time at 2023-04-14 14:01:39
And third time at 2023-04-14 14:05:43

This got my suspension audit score on us1 down to 0.965.

According to the logs, until the first time my node was stopped, it got 3655 audits from that satelite, all successful.
After the first restart, it got 0 audits (it was running for 20 seconds or so, not surprisint).
After the second restart, it got 4 audits, all successful
After the third restart until the time I am writing this), it got 17 audits, all successful.
I assume the audits were successful since the count of “download started” and “downloaded” lines match.

So, I got nearly suspended for restarting my node three times in quick succession? Really? Isn’t it a bit too harsh?

revyte · April 14, 2023, 4:32pm

You probably wouldn’t be suspended even if your suspension score went below 0.96. I was there, had a similar case, nothing happend and hours later it was back at 1.

There are different cases which will affect your score but not necessarily lead to suspension. If you node responds with gibberish, not success/failure/offline, It’s “case 5”.
For example your node is online and can respond but the data is not accessible is such a case. It can respond but not with correct nor incorrect data.
See case 5:

github.com

storj/storj/blob/20580cadd574b157e6c3eab32c1b0bdd0be44507/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

Alexey · April 15, 2023, 4:24am

this will lead to the containment mode and after two more attempts to audit the same piece with 5 minutes timeout each this audit considered as failed (and it will affect an audit score, not the suspension score).
So, you described the p.4

The suspension score is affected when none of 1-4 cases in place, like the corrupted database when the audit got requested, or your node forcible closed a connection during audit, etc.
So there is almost immediate response on audit with unexpected error.

revyte · April 15, 2023, 7:10pm

Are you sure about this? I once had unaccessible data but audit score wasn’t affected only suspension score went down.
Case 4 reads like there is data transferred but connection lost while doing so.

On the other hand

I understand exactly like this would be a case 4

Alexey · April 16, 2023, 3:43am

Yes. The suspension score is affected when the node answer on audit request and return an unknown error. If the node finally provided a piece for audit after 3 attempts, the audit will be considered as passed.
Please note, the suspension score may be affected independently of audit score and they could be affected both.
Known errors are: file not found, piece is corrupted, node answer but cannot provide a piece after 3 attempts with 5 minutes timeout each.

this will lead to timeout error on the checker’s side and your node will be placed into a containment mode to be asked for the same piece two more times.

no, the checker will receive a “context canceled, connection forcible closed by the remote host”, this will affect a suspension score and your node will be placed into a containment mode.

revyte · April 16, 2023, 3:54am

So it should be better to cut the connection than have an unaccessable file. For example if the drive would disconnect and the node is still running, thats all audit failiures? Or is there a difference between failing satellite audits and responding “file not found” to client requests?

If that’s the case I would also like to know what could have impacted the suspension score and not audit, as asked in this thread.

Alexey · April 16, 2023, 4:56am

Depends on what your OS would report - if “file not found”, then - yes, otherwise it likely will affect a suspension score and will place your node into a containment mode. But then the writeability/readability checkers will crash your node, even if your OS would partially hangs because of the not available disk.
We recently added timeouts to these checkers to help to avoid disqualification exactly for cases when the OS partially hangs instead of throwing a error. Before these checkers did not have a timeout and hung together with the OS, but since the node continued to respond to audit requests but could not give a piece even after 3 attempts, each such audit failed with some delay and after a few hours the node became disqualified.

anything what returns an unknown error on audit request.

revyte · April 16, 2023, 5:11am

I’ve seen it in the changelogs and it helps but it’s not very reliable.
Out of 4 drive disconnects the node chrashed only on 3 and the fourth time it kept going with constant log entries like “drive not found / not accessible / can’t read / write”. Don’t remember the exact wording. This was on 1.75, didn’t test on 1.76. And the result was an suspension score below 0.96 for a couple of hours but 1 on audit. That’s why I thought it was case 5.

Because only suspension score went down the anwser to the reason what happend, also to OP, is “unknown”?

Alexey · April 16, 2023, 5:40am

the timeout was implemented since 1.75.2

it’s case 5, because these errors are not known to auditors:

they are unknown to auditors, but not to humans