Yes, I could definitely see that working. It would slow down the disqualification without giving abusers a way to get out of an audit and also give an earlier warning due to disqualification. It checks all the boxes I think.
What I don’t like about it is the added complexity and having to track different statuses. Like: Is this the first, second or third or higher piece that is timing out, with different rules for each scenario. And when do we decide to reset that counter?
I think I may have a suggestion that builds on what you just described, serves the same purpose but comes with a lot less complexity, by reusing systems already in place and generalizing the rules.
Basically it would be a tweak to how containment works.
- When an audit times out, enter containment mode.
- Each timed out audit in containment mode is considered an unknown failure and reduces the unknown score.
- keep auditing the same stripe up to 10 times, reducing the unknown score 9 times, but at the 10th time, give up on that piece, count it as an audit failure and hit the audit score. Then exit containment.
This process is very similar to the current containment implementation but with more tries, with the only difference that it doesn’t just ignore timeouts a few times, but counts them as unknown failures instead. The rest of the systems could be kept exactly as is. Small tweak, but basically gets the same result as what you suggested.
At the current settings, this would allow you to run into this issue with a single piece and not really run into any suspension or disqualification. (I believe you currently need 10 unknown failures to be suspended) But as soon as a second one starts timing out, you will hit suspension and get alerted. Probably fair right? Since if it is rare corruption, the chances of a second piece failing sequentially should be basically 0.
The result would be that you get a suspension (and warning email connected to that) in 1/3rd the time it now takes to disqualify, while at the same time allowing for over 3x as much time to resolve the issue. Without giving anyone the chance to avoid having to cough up the data for an audit.
If my other suggestion is implemented, the dynamics change a little. You could have about 3 consecutive failures before you get suspended. I think that would still be fine, especially for larger nodes. By making it dynamic based on node size this could be tunes further.
So then what’s left is to decide whether the consequences of suspension should change from what they are now. I believe that currently egress still happens on suspended nodes. In these scenarios where nodes are stuck, all that does is guarantee a failed transfer among the bunch, which hurts customer experience if it happens too often. So that’s probably not a great idea. I would say this probably goes for other scenarios of suspension as well. So maybe just get rid of egress for suspended nodes to begin with. Why keep giving them good egress income if they are not living up to the reliability requirements. They’ll get egress again when they’ve resolved the issue and have recovered their scores. So I would suggest audits only at that point. Marking pieces as unhealthy so they get picked up for repair. This gives an added incentive to fix things quickly, because you are losing data to repair and future income as a result.
What do you think?