We should disqualify or remove the unreliable node from nodes selection process somehow to deal with it later to do not affect customers. All pieces on unreliable nodes must be marked as unhealthy to trigger repair earlier before the number of available pieces would fall below the threshold.
There is no alternative, losing customers will lead to fail the network. No customers - no nodes, it’s simple as that. So failing node must be isolated ASAP. Thus so short time interval.
At the moment the suspension is implemented for unknown errors, which are not “file not found”, not timeouts and not pieces corruption.
This suspension for unknown errors is implemented to figure out what are errors could be to include their detection to the pre-flight check and storage monitor function. After adding detection for all remained class of unknown errors this suspension should not be triggered too much in the future.
The timeout error is still the issue - when the hardware or OS become unresponsive for any reason (usually - out of RAM or out of space on system drive or dying HDD, RAM corruption, etc.), so it’s response on audit request (so it’s online), but cannot provide a piece for audit (because underlaying OS functions performed too slow or hangs). This issue shows that this node is not reliable, thus it should be disqualified ASAP if too many audits are failed to do not affect customers and data. If you allow to survive it too long without marking pieces as unhealthy you easily fall into situation when there is not enough pieces for recover.
The suspension can protect such node from disqualification and gives graceful period to fix the issue before the actual disqualification. It also treated as unhealthy on the repair service. So almost all downsides of disqualification are included, except using held amount to recover if the number of healthy pieces would be lower than threshold.
The unknown errors are rare, so the loses in money for the satellite operator is relatively small. The timeout errors are much often issue. Then “file not found” and “corrupted” issues.
Perhaps to enable suspension for audit failures due timeouts we should implement a usage of held amount. The price of recover is high and the node’s held amount could be not enough to cover the costs. As result your node will be suspended, lost all held amount, the reputation is zeroed (to force start to collect the held amount again from 75% level) and data being slowly deleted from it, but while it there it will consume the valuable space.
Do you still think it’s better than disqualification when you can start from scratch?
If so - please, make a feature request to allow the Team to take it into consideration. Please, put yourself into satellite’s operator and the customers’ shoes.
Or even better - make a pull request on GitHub