Tuning audit scoring

Back to here. There is a clear proposal here to change the beta parameters and a lot of idea how the overall reputation system can be improved, but that’s a bigger step / tasks.

So I think the proposal here is to update the numbers first, and in the next step improve the other part of the reputation system. (if it makes sense…)

1 Like

Going by conversations on the forum it seemed that this was not yet in place. But maybe I’m remembering this wrong. If it is already in place, I see no reason to deviate from what already exists.

+1 from me on that. :slight_smile:

2 Likes

Yeah, it sounds reasonable to update numbers first and them move on to larger changes. I think I addressed all relevant comments from my end in my initial response here.

In summary in my opinion the suggested numbers would be an improvement. But there are better settings to go with. @thepaul mentioned some issues with transitioning to those settings and I’m not certain those have been resolved.

So in short, I would prefer lambda = 0.999 and threshold = 0.95 or 0.96 to smooth out the scores. But the proposed changes by @thepaul would already be an improvement, so if that’s the best we can get, it’s better than what we have now even if it doesn’t get rid of the volatility.

But I would like to reiterate that this volatility was one of the major concerns from the SNO community. And the solution seems to shift more towards protecting the network from bad nodes. Which I understand has priority, but I don’t want that to overshadow the SNO concerns addressed here. Even with the proposed changes, the remaining volatility would still steer node operators, who are trying to fix issues, wrong by allowing scores to recover significantly even if nothing has been resolved.

1 Like

Sorry for the dead air from my end. I’m still wanting to find out what rate of data loss our model can tolerate before we make a decision on the ideal lambda and disqualification threshold. Give us just a little more time to make that determination. I need to loop in some additional people.

5 Likes

its very understandable that it takes time, better safe than sorry in this case as it could bring down the entire network…
maybe some sort of simulation or actual tests on the test net might be the way to go…
math quickly gets a life on its own, especially when it interacts with the real world.

there is no rush to get this fixed, its been running like this for ages… so its fine, but a solution would be better ofc :smiley: when its theoretically and verifiable good.

2 Likes

I’m back!

After re-reviewing this whole thread, I’ve re-run the simulations with 2 significant changes:

  1. It’s fine to start with an alpha >= 1/(1-lambda).
  2. It’s fine to reset all node reputations as a one-time change.
  3. An added point of evaluation is “how long does it take to get from a perfect score to a DQ when (apparent) data loss is suddenly 100%?” We want this to be larger than 10 audits, but probably less than hundreds of audits.

Given those, I don’t see any set of parameters that does significantly better than @BrightSilence’s:

  • lambda=0.999
  • DQ threshold=0.96
  • initial alpha=1000

With those parameters, it takes around 40 consecutive failed audits to go from a perfect score to a DQ. On a busy node, that still is not a very significant amount of time, but it is at least several times larger than what it was. If we have the writeability check timeout added as well, this seems like it could be an acceptable situation.

The grace period I was suggesting (no disqualifications before 50 audits have been completed) no longer makes a significant difference with this high initial alpha, so I’m taking that out.

So the new change plan is:

  1. Change lambda to 0.999
  2. Use alpha=1000, beta=0 for new nodes
  3. Reset all node reputations to alpha=1000, beta=0
  4. Change the DQ threshold to 0.96

I think we can even do all of these at the same time. What does everyone think?

3 Likes

With the new numbers, how quickly (in hours) can a node be disqualified with no real loss of data, but with apparent loss of data (frozen IO subsystem, overload leading to timeouts etc)?

1 Like

this change isn’t really a fix for the issues with nodes getting DQ for system issues.
but to ensure that in the case of minor dataloss the audit score flux is greatly reduced, to avoid near random DQ in such cases.

think the points are best described by Bright in the post below.

i still think we should have some sort of local fuse type trigger which will shutdown a node that starts to rapidly drop in audits, but that is really a local thing rather than a network thing.

ofc with a to effective fuse, nodes will never get DQ, which is partly why i never bothered making a node fuse script one, as it would be a bad feature introduction, even tho its very possible to script something like that.

i think it would be great if there was a fuse in the node itself… so that one got like 3 tries to fix the issue before the node died…

like say the node fuse feature would offline a node after a rapid unexpected 10% drop in audit score.

But it will change the situation for people affected by problems described by @Pentium100. I believe it would be worth checking as well.

Also, it would be nice to know whether in case of real problems, if the SNO fixes the problem, how soon will they get feedback that it was a correct fix.

This change now suggested is so far as i can tell almost exactly what @BrightSilence suggested and gave his reasons for in the post which i just linked.

it improves the node audit behavior incase of failed audits, if you have experienced dataloss on a node you will know, that the audit score jumps around… not a little but a lot, and if it ever touches the 60% the node will be DQ instantly.

this change makes the audit score less volatile so it won’t just go back to 100% and then drop to 90% only to go back to 100% maybe an hour later.
and it also as thepaul says increases the number of failed audits required for DQ.
which in the current system can be something like 9 audits.

so really no matter which view point one has, this change is a big improvement for anyone in all cases…

only nodes that will suffer from this new change is the ones that have been lucky enough to survive with higher than 4-5% dataloss

i have also had a lot of weird things happen with my storage over the few years i’ve been running storagenodes, and even with loss of contact with the storage, even if the node doesn’t shutdown, it doesn’t seem to affect the audit score at all…

i’ve had nodes run for 4-6 hours without any contact to the storage media, because it was stalled out due to overload and barely even seen a drop in suspension score from it.
so i don’t know how real the problem of unfair DQ actually are outside of the ones that happen due to audit volatility.

@thepaul
i think it looks great, couldn’t have hoped for better.
also pleased that this hasn’t been rushed…
will be interesting to see what bright says… but as its basically his suggestion, i doubt he will disagree on this change.
but i might be missing something, i haven’t exactly done the deep dive he did into this.

yeah damaged or corrupted files aren’t really forgiven… took like 18 months before a node i lost a few files on started to not randomly drop to 95% audit score.
ran the node for 2 minutes and then did a rsync without removing the --delete parameter so the newly uploaded files was removed…

but it has now returned to 100% afaik… rarely checks on it… so it might drop for short periods… however its been a long time since i noticed it.

yeah this is true, but then we get into some of the subjects discussed earlier in this thread, the current algorithm isn’t just one that can be easily changed, i forget the exact reasons.

so tho the parameters can be changed, the method remains the same and sadly that method does make it so that larger nodes will be DQ faster due to more audits over less time.

yet like i said i’ve had plenty of system issues and also dared to experiment at times, letting my 17TB node sit for hours, to see if it would actually recover… but never actually failed an audit because of it.

so i’m not convinced it’s a real problem.
it could simply be that those that actually get DQ had bad storage and just think they was unfairly DQ.

if there is something i’ve found while running larger storage setups, is how unreliable and completely random hdd’s can be.
so in cases of unreliable storage behavior it’s possible that people might see DQ, without the disk actually being broken, due to lets say a bad cable corrupting writes…

which is why i always recommend larger multi year old nodes to run with redundancy, but don’t for new nodes…

again with this new change of the audit score algorithm parameters should make the score more stable and thus it will be easier to see when it start to drop because it won’t jump back to 100%.

so yeah… a node fuse type feature might be good, but not sure its really needed…
i haven’t seen any signs of it, and i’m running quite the number of nodes.
but its something i worry about, which is why i have been trying to determine if it is something i should be worried about, and these days i don’t really worry about that part…

Thanks @thepaul for getting back to us with this. I’m happy to see you liked my suggested parameters.

It won’t surprise you to hear I think this is great. I think it ticks all the boxes with the exception of still DQ’ing nodes a little too fast when temporary issues make them fail all audits. But to be honest, I think it’s impossible to fix that without making DQ too slow for legitimately bad nodes, by just changing the parameters.

I haven’t redone the testing on this now, since I already tested these exact parameters when I posted my previous suggestion and it showed it would match the intention of allowing 2% loss to survive, but not 4% or higher. Since it also fixes the issues I listed when I posted my initial suggestion. Yeah, this sounds like a home run to me.

I agree that there probably need to be other things in place to prevent temporary issues like this from DQ’ing the node. Implementing the time out check would help with that. I think I’ve also seen this happen when permission issues prevented the node from reading the files. I’m not sure why the readability check didn’t catch those in some examples posted around the forum. But that can be solved by other means.

It will happen about 4x slower. On large nodes this isn’t a fix. But it helps a little. It can still happen in an hour on the largest nodes for the largest satellites on that node. But these changes weren’t meant to fix that. That they help is just a small bonus.

I think if the timeout implementation for the readability check is in place, those scenarios will be solved as well.

With the proposed changes a 10% loss would be a guaranteed DQ. The current formula is indeed too forgiving. I’m sorry if that would cost you this specific node. But I don’t think that’s unfair.

3 Likes

At least it’s not faster. IMO there should not be a situation where a node is irreversibly disqualified in less than a couple of days. Reversible suspension etc can happen quickly, but not the irreversible DQ.

5 Likes

I’d add not for ‘older’ nodes. The node age and size should help to indicate, if DQ is really the right option.

1 Like

Maybe this would also help to find a better DQ approach:

Because as I read it today, there is a limit on how many audits can be performed.

Maybe in the future, when more audits can be performed a better DQ process is possible for example:
If a node gets disqualified but the issue was something the node operator could not see or because disqualification was to fast, a node operator could apply for a requalifying. This could mean that the node gets no ingress for like a month and gets hammered with audits around the clock or something like that to reinstate the reliability. I cannot make exact proposals here, but the idea should be clear.

1 Like

OR, instead of irreversible DQ, the node gets suspended after failing X audits. The operator can apply for un-suspension, but then the satellite checks for those same files that the node failed the first time.

1 Like

However then the node operator knows which files are relevant and a malicious actor might use this information to play games. I don’t know, but maybe.
So I think an approach where a node operator does not know what will get audited is required.

Something else should be audited as well, but if the claim is that “all the files are there, just the USB cable fell out” or similar, then the node should be able to produce the files it failed before.
If those same files were not rechecked, then a node that has actually lost them could still pass the other audits.
I don’t have a problem with nodes being disqualified for actually losing data. My problem is nodes that are disqualified for timeouts or some system problems without giving reasonable time for the operator to fix those problems. If Storj wants the node operators to be regular people instead of datacenters, then there should be no expectation of a datacenter-like reaction time. I do not have staff on-call when I go to sleep or go on vacation, even though my setup is probably one of the more datacenter-like otherwise.

What I am saying is that it might not be enough to recheck only those pieces where a malicious actor can simply pull them from a log and knows what is going to be audited to get reinstated.
To get a node that lost its reputation back into the network there probably needs to be more.
Not because of the good participants, where indeed a lose USB cable got fixed and the node is in perfect shape. But because of the bad actors who might tamper with the data.

1 Like

Sure, but rechecking the same files would prevent someone who actually lost them from being reinstated. Additional checks of previously-unchecked files may be needed as well.