Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive

BrightSilence · August 25, 2021, 3:26pm

Hey all, as a followup to recent discussions around nodes being disqualified for being temporarily unresponsive, I decided to collect info from discussions spread out over several topics and suggest a solution that would help. You can find PR with the blueprint here: docs/blueprints: Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive by

Here’s the blueprint itself: storj/audit-containment-v2.md at 3d8008f6f77afb916f84afc9e452d92532847ffc · storj/storj · GitHub

Thanks @pietro for kicking off this discussion with your suggestion and @Alexey for the extensive back and forth to refine this idea. And of course everyone else who contributed or reported issues with their nodes. Please vote if you want this blueprint to be implemented.

nerdatwork · August 25, 2021, 3:58pm

First of all its impressive to go to the length of taking into account the issues SNOs are facing and coming up with a way to resolve it.

Let me be the first to Thank you for the hard work

Now could you please elaborate on what you meant by “crash the node”. Crashing in general signals abrupt end to a normal processing. If you meant to shutdown the node altogether to overcome the issue then it would sound less abrupt.

BrightSilence · August 25, 2021, 4:00pm

Yeah, I meant throwing a fatal error and shut down.

Btw, that part of the blueprint is not well defined yet. The further considerations category is mostly just some more thoughts on the same subject, but not required for the suggested change.

Pac · August 25, 2021, 9:02pm

I still feel like the node should simply start rejecting ingress pieces (and egress too, probably) when the disk (or the system) is starting to stall and before it becomes completely unresponsive, so it has time to recover, while still being able to fulfil audit requests.

But I guess that wouldn’t solve all issues, especially if the node becomes fully unresponsive because of something else…
That’s tricky

jammerdan · August 26, 2021, 8:11am

I hope this will get some votes and attention to make sure there will be at least a discussion how the system can be changed so that nodes don’t get disqualified too quickly within a few hours without the chance for the SNO to fix an underlying issue.
I have just skimmed through the text and I don’t know if the suggestions will solve the issues entirely, but I noticed there is already a discussion going on on Github. So I hope changes will be made for the better.
I think quick suspension instead of quick disqualification is the way to go. I also think a node in suspension does not need to receive or send traffic or get paid, so there is no incentive for a bad actor to game the system. And if satellites keep track of the requested pieces for audits and keep requesting them from a node until a final response, then it should be hard for a bad actor to delete masses of data but escape auditing by intentionally timing out the responses to the satellites requests.

Maybe some more is required, but this could give honest and long-term node operators a chance to fix issues and recover their nodes from a temporary failure. This current situation definitely needs a change as I don’t believe SNOs who have invested a lot of time into running their nodes reliably should be punished for temporary issues in such a harsh way.
Furthermore I believe those SNOs with the only option left to start a new node after disqualification are rather likely to quit the network entirely. Maybe they play this game 2 or 3 times but after that it is probably no longer worth for them to invest any more time into running a node.

Alexey · August 26, 2021, 8:32am

I’m assure you, the Team is working on. And you actually see that on GitHub.
@BrightSilence thank you for the blueprint!

BrightSilence · August 26, 2021, 8:58am

Yeah, I’m getting some pushback from @littleskunk, which I appreciate. If an idea can’t stand up to such scrutiny it doesn’t deserve to be implemented. But I think in this case it can stand up to it, so I’m happy to defend it further and clarify.

If this doesn’t get enough people within Storj on board, than that’s a shame, but it’s fine. It’s a lose the battle, win the war kind of thing if it brings more attention to the fact that this is a serious concern of the SNO community. I’m sure there will be a good solution in the end, as goals seem to be mostly aligned on this. The debate is more about the implementation than the goal, which is a good place to be in.

Pentium100 · August 26, 2021, 2:26pm

The blueprint looks OK to me.

In general, if the goal is to allow regular people to run nodes, reaction time has to be taken into account. So, in my opinion, Storj should pick one (and only one) option:

Require very fast reaction times (a few hours), advertise that nodes should preferably be run only by datacenters that can provide good monitoring and fast reaction times.
Advertise that nodes preferably should be run by regular people and relax the reaction time requirements to at least a few days, since a regular person does not have a datacenter and employees who can monitor the node and fix any problems within a few hours.

Advertising #2 and then punishing node operators for not providing the quality of service that would be expected of #1 is, in my opinion, a bit dishonest.

I am OK with whatever decision Storj makes, as long as it is clearly communicated and people are not deceived into running nodes on hardware and other capabilities that are below what Storj expects of them.

jammerdan · August 27, 2021, 12:59pm

The discussion on Github seems to take an interesting route now:

docs/blueprints: Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive by ReneSmeekes · Pull Request #4179 · storj/storj · GitHub

If my math is correct 10 reverify would give a bigger node about 9 hours. This brings up an interresting question. How about tracking time and not a count. This would make it fair for any sized node. Timeout a few audits in a row = suspension. Satellite continues to request the same pieceID for the next 7 days (just to take some extrem values). Small nodes might get 7 audits in total and bigger nodes maybe 7000 audits for the same pieceID. No change in behavior = disqualification.

Time is definitely the better alternative. Final duration can either be fixed or even depend on the node age, so that older nodes could get some bonus days before the ban hammer.

Pentium100 · August 27, 2021, 3:07pm

I agree. As long as there’s one last audit right before disqualification. This would avoid a problem where a node is fixed, but does not get any more audits and is disqualified.

jammerdan · August 27, 2021, 3:16pm

Good point. Yes it would be annoying if the node has been fixed and it gets disqualified without any more audit happening before the time is up.

Maybe audit frequency should increase towards the end.

thepaul · August 27, 2021, 4:21pm

It’s not very visible yet, but we’ve been discussing talking about this proposal a lot internally and working how it fits in to plans we were previously formulating to solve the same issue. It’s a really good proposal, and I think it will end up being adopted, albeit possibly with some slight changes.

BrightSilence · August 27, 2021, 9:19pm

Thanks for getting back to us @thepaul. Good to hear it’s being discussed. Let me know if there is more I can do from my end and it would be great if you could keep us posted on the progress every now and then!

BrightSilence · August 27, 2021, 9:23pm

This will most likely be the case since that just so happens to be the more economical way to implement that time out anyway. That is if a time based approached is chosen for the final implementation.

I’m very happy with the direction this has taken, after clearing up the initial confusion the back and forth is leading to good improvement suggestions. And that’s even without knowing what exactly has been discussed internally. I’m certain we’ll see something good come out of all the efforts to tackle this problem.

Pac · August 27, 2021, 9:50pm

Me too. Great job you guys!