Got Disqualified from saltlake

Toyoo · August 20, 2021, 6:34pm

Yeah, I actually remembered your post when typing one of the previous messages here. However, I believe it solves a different problem, not the one discussed here. Please consider that in your simulations you are assuming independence in subsequent audit failures. This nicely model a hard drive having bad sectors. However, in case of transient system-level problems, like heavy swapping because another task hosted on the same box took all memory and then some, you might fail all audits until the box is restarted.

You state there that your experimental choice of parameters would allow 40 audits in a row to be failed. From my point of view, this is too much and too little at the same time. It’s too little when you consider Storj now tells operators that they can go to vacations, because if their nodes become offline for 2 weeks, that’s not a big problem. Stating then that they’d have 40 hours to fix a misbehaving node makes that statement much weaker. It’s too much because the node is still considered a candidate for fresh uploads/downloads, which will obviously fail the same way audits do, making customer experience worse.

That’s why I like the idea from the other thread: it explicitly puts the node that is failing audits into a special state, not trusted enough to handle traffic, but still with hope for recovery.