Tuning audit scoring

I’m back!

After re-reviewing this whole thread, I’ve re-run the simulations with 2 significant changes:

  1. It’s fine to start with an alpha >= 1/(1-lambda).
  2. It’s fine to reset all node reputations as a one-time change.
  3. An added point of evaluation is “how long does it take to get from a perfect score to a DQ when (apparent) data loss is suddenly 100%?” We want this to be larger than 10 audits, but probably less than hundreds of audits.

Given those, I don’t see any set of parameters that does significantly better than @BrightSilence’s:

  • lambda=0.999
  • DQ threshold=0.96
  • initial alpha=1000

With those parameters, it takes around 40 consecutive failed audits to go from a perfect score to a DQ. On a busy node, that still is not a very significant amount of time, but it is at least several times larger than what it was. If we have the writeability check timeout added as well, this seems like it could be an acceptable situation.

The grace period I was suggesting (no disqualifications before 50 audits have been completed) no longer makes a significant difference with this high initial alpha, so I’m taking that out.

So the new change plan is:

  1. Change lambda to 0.999
  2. Use alpha=1000, beta=0 for new nodes
  3. Reset all node reputations to alpha=1000, beta=0
  4. Change the DQ threshold to 0.96

I think we can even do all of these at the same time. What does everyone think?


With the new numbers, how quickly (in hours) can a node be disqualified with no real loss of data, but with apparent loss of data (frozen IO subsystem, overload leading to timeouts etc)?

1 Like

this change isn’t really a fix for the issues with nodes getting DQ for system issues.
but to ensure that in the case of minor dataloss the audit score flux is greatly reduced, to avoid near random DQ in such cases.

think the points are best described by Bright in the post below.

i still think we should have some sort of local fuse type trigger which will shutdown a node that starts to rapidly drop in audits, but that is really a local thing rather than a network thing.

ofc with a to effective fuse, nodes will never get DQ, which is partly why i never bothered making a node fuse script one, as it would be a bad feature introduction, even tho its very possible to script something like that.

i think it would be great if there was a fuse in the node itself… so that one got like 3 tries to fix the issue before the node died…

like say the node fuse feature would offline a node after a rapid unexpected 10% drop in audit score.

But it will change the situation for people affected by problems described by @Pentium100. I believe it would be worth checking as well.

Also, it would be nice to know whether in case of real problems, if the SNO fixes the problem, how soon will they get feedback that it was a correct fix.

This change now suggested is so far as i can tell almost exactly what @BrightSilence suggested and gave his reasons for in the post which i just linked.

it improves the node audit behavior incase of failed audits, if you have experienced dataloss on a node you will know, that the audit score jumps around… not a little but a lot, and if it ever touches the 60% the node will be DQ instantly.

this change makes the audit score less volatile so it won’t just go back to 100% and then drop to 90% only to go back to 100% maybe an hour later.
and it also as thepaul says increases the number of failed audits required for DQ.
which in the current system can be something like 9 audits.

so really no matter which view point one has, this change is a big improvement for anyone in all cases…

only nodes that will suffer from this new change is the ones that have been lucky enough to survive with higher than 4-5% dataloss

i have also had a lot of weird things happen with my storage over the few years i’ve been running storagenodes, and even with loss of contact with the storage, even if the node doesn’t shutdown, it doesn’t seem to affect the audit score at all…

i’ve had nodes run for 4-6 hours without any contact to the storage media, because it was stalled out due to overload and barely even seen a drop in suspension score from it.
so i don’t know how real the problem of unfair DQ actually are outside of the ones that happen due to audit volatility.

i think it looks great, couldn’t have hoped for better.
also pleased that this hasn’t been rushed…
will be interesting to see what bright says… but as its basically his suggestion, i doubt he will disagree on this change.
but i might be missing something, i haven’t exactly done the deep dive he did into this.

yeah damaged or corrupted files aren’t really forgiven… took like 18 months before a node i lost a few files on started to not randomly drop to 95% audit score.
ran the node for 2 minutes and then did a rsync without removing the --delete parameter so the newly uploaded files was removed…

but it has now returned to 100% afaik… rarely checks on it… so it might drop for short periods… however its been a long time since i noticed it.

yeah this is true, but then we get into some of the subjects discussed earlier in this thread, the current algorithm isn’t just one that can be easily changed, i forget the exact reasons.

so tho the parameters can be changed, the method remains the same and sadly that method does make it so that larger nodes will be DQ faster due to more audits over less time.

yet like i said i’ve had plenty of system issues and also dared to experiment at times, letting my 17TB node sit for hours, to see if it would actually recover… but never actually failed an audit because of it.

so i’m not convinced it’s a real problem.
it could simply be that those that actually get DQ had bad storage and just think they was unfairly DQ.

if there is something i’ve found while running larger storage setups, is how unreliable and completely random hdd’s can be.
so in cases of unreliable storage behavior it’s possible that people might see DQ, without the disk actually being broken, due to lets say a bad cable corrupting writes…

which is why i always recommend larger multi year old nodes to run with redundancy, but don’t for new nodes…

again with this new change of the audit score algorithm parameters should make the score more stable and thus it will be easier to see when it start to drop because it won’t jump back to 100%.

so yeah… a node fuse type feature might be good, but not sure its really needed…
i haven’t seen any signs of it, and i’m running quite the number of nodes.
but its something i worry about, which is why i have been trying to determine if it is something i should be worried about, and these days i don’t really worry about that part…

Thanks @thepaul for getting back to us with this. I’m happy to see you liked my suggested parameters.

It won’t surprise you to hear I think this is great. I think it ticks all the boxes with the exception of still DQ’ing nodes a little too fast when temporary issues make them fail all audits. But to be honest, I think it’s impossible to fix that without making DQ too slow for legitimately bad nodes, by just changing the parameters.

I haven’t redone the testing on this now, since I already tested these exact parameters when I posted my previous suggestion and it showed it would match the intention of allowing 2% loss to survive, but not 4% or higher. Since it also fixes the issues I listed when I posted my initial suggestion. Yeah, this sounds like a home run to me.

I agree that there probably need to be other things in place to prevent temporary issues like this from DQ’ing the node. Implementing the time out check would help with that. I think I’ve also seen this happen when permission issues prevented the node from reading the files. I’m not sure why the readability check didn’t catch those in some examples posted around the forum. But that can be solved by other means.

It will happen about 4x slower. On large nodes this isn’t a fix. But it helps a little. It can still happen in an hour on the largest nodes for the largest satellites on that node. But these changes weren’t meant to fix that. That they help is just a small bonus.

I think if the timeout implementation for the readability check is in place, those scenarios will be solved as well.

With the proposed changes a 10% loss would be a guaranteed DQ. The current formula is indeed too forgiving. I’m sorry if that would cost you this specific node. But I don’t think that’s unfair.


At least it’s not faster. IMO there should not be a situation where a node is irreversibly disqualified in less than a couple of days. Reversible suspension etc can happen quickly, but not the irreversible DQ.


I’d add not for ‘older’ nodes. The node age and size should help to indicate, if DQ is really the right option.

1 Like

Maybe this would also help to find a better DQ approach:

Because as I read it today, there is a limit on how many audits can be performed.

Maybe in the future, when more audits can be performed a better DQ process is possible for example:
If a node gets disqualified but the issue was something the node operator could not see or because disqualification was to fast, a node operator could apply for a requalifying. This could mean that the node gets no ingress for like a month and gets hammered with audits around the clock or something like that to reinstate the reliability. I cannot make exact proposals here, but the idea should be clear.

1 Like

OR, instead of irreversible DQ, the node gets suspended after failing X audits. The operator can apply for un-suspension, but then the satellite checks for those same files that the node failed the first time.

1 Like

However then the node operator knows which files are relevant and a malicious actor might use this information to play games. I don’t know, but maybe.
So I think an approach where a node operator does not know what will get audited is required.

Something else should be audited as well, but if the claim is that “all the files are there, just the USB cable fell out” or similar, then the node should be able to produce the files it failed before.
If those same files were not rechecked, then a node that has actually lost them could still pass the other audits.
I don’t have a problem with nodes being disqualified for actually losing data. My problem is nodes that are disqualified for timeouts or some system problems without giving reasonable time for the operator to fix those problems. If Storj wants the node operators to be regular people instead of datacenters, then there should be no expectation of a datacenter-like reaction time. I do not have staff on-call when I go to sleep or go on vacation, even though my setup is probably one of the more datacenter-like otherwise.

What I am saying is that it might not be enough to recheck only those pieces where a malicious actor can simply pull them from a log and knows what is going to be audited to get reinstated.
To get a node that lost its reputation back into the network there probably needs to be more.
Not because of the good participants, where indeed a lose USB cable got fixed and the node is in perfect shape. But because of the bad actors who might tamper with the data.

1 Like

Sure, but rechecking the same files would prevent someone who actually lost them from being reinstated. Additional checks of previously-unchecked files may be needed as well.

well shutting down a node before it gets DQ is completely within the scope of what node operators can do, using logs and scripts…

but doing something as simple as saying if 3 audits fail within 1 hour or whatever then it just shut down the node…
i thought about doing this early on, because i was worried about immediate DQ if something went wrong…

but lots of things have gone wrong, and i haven’t seen any signs of this being a real issue… i still think its more people being worried and people being justifiably DQ but not understanding their storage solution or hardware was flawed.

but if we as SNO’s introduce such mechanics into the network, it is a bad thing for the network as nodes would be near impossible to DQ for the network, which might have unforeseen effects.

however it’s just a matter of time i suppose until somebody publishes a script for it.

I’m not worried about that. If there is a real issue, the node would either have to stay offline or continue to fail audits. It wouldn’t be able to avoid DQ anyway. And if the node is offline, the data will also be marked as unhealthy and repair will already kick in. So the network will be completely fine either way.

Before the readability and writability checks were in place, sudden DQ did happen. So I understand people being worried. By far the biggest cause has been fixed, but a stuck IO system can still cause issues. That is already in the process of being fixed too. I think after that we may have seen the last of this. And if not, we’ll keep an eye out for it and make sure Storj Labs knows if there are other issues to address.


It sounds like we have a consensus, at least as far as the present suggestion is concerned. I’ll get it implemented!


The question here is, should a node that has proven to lose files should be allowed back into the network?
I think the idea of reinstating a node back into the network should cover cases where files have not been lost.

I don’t think it should. That’s why the recheck of the same files. If reinstatement only required passing a lot of audits of random files (not the same ones that previously failed) then a node that has actually lost files could be reinstated.
If the requirement is to pass the same audits that failed before, then only a node that has not actually lost data (USB cable fell out etc) could be recovered.

Unless, of course, the failed audits were because of a node or satellite software bug (satellite deletes a file and then tries to audit it etc).