New audit scoring is live

Alexey · August 23, 2022, 7:28pm

What’s mean in a first glance?

threshold is now 96% instead of 60%;
40 consecutive audits failures is enough to go from a perfect score to disqualification (it was 10 audits before);
only GET_AUDIT and GET_REPAIR affects audit score.

Bivvo · August 23, 2022, 7:48pm

Sorry, what?

40 audits or audit runs? What’s the benchmark? Meaning, one single audit piece issue is the same as one single audit failure, right?

How many audits checks are how often done? Does it mean that large nodes can be disqualified almost immediately?

Is there a summary of the 107 answers post?

Alexey · August 23, 2022, 8:17pm

GET_AUDIT and GET_REPAIR affecting audits the same way. So, you can count how many audits are happening on your nodes. 40 failed consecutive GET_AUDIT and GET_REPAIR will be enough to disqualify your node.

SGC · August 23, 2022, 8:27pm

That is awesome news…

and for those not keeping up… this means audits scores should be much more stable…
tho nodes with more than 4% corrupted data will be DQ…

The new audit score is to avoid random unjustified DQ’s of nodes.
something like that, should be good for everyone… except if one has a node that has been lucky enough to survived with more than 4% corrupted data… if so then its bad news

Bivvo · August 23, 2022, 8:46pm

Never imagined that would or could be possible: RANDOM + UNJUSTIFIED. Sounds like Russian roulette. Wait, this is equivocal these times… It’s like, nah, i don’t want to go to work today and you’re guilty.

Thank you both.

Pac · August 23, 2022, 9:38pm

Technically behind the scene? Or as displayed on the dashboard too?

If the dashboard keeps displaying the “real” value, it’s a problem IMHO: it was already weird and difficult to explain to new comers that a node gets DQed when dropping below the arbitrary 60% (instead of 0% as anyone would assume). If the new value is 96%, this score is hardly understandable anymore on the dashboard, is it?

thepaul · August 23, 2022, 10:50pm

So far, it does not appear that most people assume that the audit reputation can go down to 0% before a disqualification event. Think of the audit reputation as a measure of what percentage of data the satellite thinks your node has stored correctly.

But if you have more concrete recommendations for how to improve the UI, this forum would be a good place for that!

BrightSilence · August 23, 2022, 11:52pm

I can give that a go.

First off, DQ can indeed happen after 40 consecutive audit failures. This is up from only 10 before the change. But this wasn’t necessarily the intended goal of the changes. (@Alexey it might be good to mention that it is an increase from only 10 before in the top post)

From the original topic I outlined these issues:

Bottom line, the scores were extremely erratic even with minor data loss. Even if that data loss never changed, scores were jumping all over the place, giving node operators the impression that things got worse or better, while in fact the situation remained the same. And some nodes with significant data loss could survive, even though they shouldn’t have been allowed to.

We went through a lot of back and forth, but the new approach we landed on fixes all of these issues. The score now has a longer memory of old audits, making it change less rapidly and show as much more stable. The score of 96 much closer resembles the allowed data loss, making it a more meaningful number as well.

The adjustments we ended up arriving at were both aimed at fixing the problems listed before, as well as these guidelines set out by @thepaul

Basically the idea was to let all nodes up to 2% data loss survive, but DQ all nodes with 4% loss or more. Inbetween those numbers it’s kind of luck of the draw.
A node with 3% loss could survive for a long time, but may still eventually be disqualified if it runs into some bad luck causing more consecutive audits of lost pieces than you would normally expect.

So in summary:

The new scores more accurately show the actual percentage of data loss on your node
The scores are much more stable, meaning that a recovery of more than 1% means there has been an actual improvement. A drop of more than 1% means something has gotten worse.
There is no longer a lot of luck involved in whether your node will survive or not, DQ now closely depends on actual data loss
Nodes with more than acceptable data loss are no longer allowed to survive
As a small bonus, nodes with temporary issues causing audit failures take 4x longer to be disqualified (this is not a complete solution to the temporary problem issue, but it helps nonetheless)

Hope that clears things up!

And thanks again @thepaul (and everyone else working on this behind the scenes) for putting so much effort into this and working closely with the node operator community to arrive at these changes. I’m excited to see this be implemented now! I think it will be great for node operators as well as for the network.

BrightSilence · August 24, 2022, 12:09am

Some more highlights.

This is what scores would do over time with 15% data loss, before the implemented changes.

This is a graph of scores with 4% data loss with the new changes.

Note: the threshold line is at 95 here. This was bumped to 96 to better match the intended allowed data loss.

Here are some tests ran with the new settings in a simulation script built by @thepaul .

Tuning audit scoring

Tuning it even tighter by raising the threshold to 96% and you get exactly the result you wanted. No nodes disqualified at 2% data loss, all nodes disqualified at 4% data loss.
./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.03 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 3.00% data loss, 21.73% of runs hit dq (after 6275.94 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 4.00% data loss, 99.77% of runs hit dq (after 2906.48 rounds on avg)
Ok, I guess 99.77% isn’t all, but I hope you’ll forgive me that last 0.23%.

That script can be found in this post by him: Tuning audit scoring - #16 by thepaul

And these are the changes we eventually landed on for those who want to dive into the math:

Bivvo · August 24, 2022, 4:40am

Amazing summary, thank you! @BrightSilence

The point above includes the case, when there was a minor data loss in the past and it’s absolute value is shrinking due to new data uploaded since then?

Pac · August 24, 2022, 5:51am

Maybe here on the forum as this piece of info has been repeated many times. Most SNOs aren’t on the forum though.

We had extensive discussions on that matter 2 years ago. But the subject did not attract much attention from the community in the end (vote wise).

You might want to have a peak though Here:

Pentium100 · August 24, 2022, 6:30am

Or, about 10 minutes for my node. That’s … fast.

SGC · August 24, 2022, 7:45am

but the odds of picking 40 pieces where they are all bad is astronomically low on a system with less than 4% data loss…

the odd’s should be something like 4% of 4% of 4% … 40 times in sequence.
so 4% x 0.04 = 0.16% thats the 2nd, and 3rd is 0.0064%

the 9th would be 0.00000000000262%
20th would be 1,09e-26% basically just think of the e-26 count as how many zeros goes in front.

now is a perfect time to introduce some scale.
lets take something we can maybe loosely imagine.
the estimated number of sand grains on earth.

9.6 x 10^13 x (8 x 10^12) = 4.6 x 10^23 sand grains

duno how accurate that is, but it seems around what i would have expected.

so from now on, we will be taking a 4% change of picking the same grain of sand twice in a row.
i think… or less, tho this example doesn’t last for long because in just a step or two a grain is way to much

and the 40th step takes us to 1.2e-54
so the chance that your node with 4% data loss attempts an audit on lost data 40 times in a row would be

0.000000000000000000000000000000000000000000000000000000000012%
or something like that…

it basically cannot happen …

Alexey · August 24, 2022, 8:08am

We have at least one SNO, affected by this new score system. They have had more than 4% of data loss in the past (they also running dozens nodes on the one server, so dozens nodes were affected), the disqualifications are happened after 4.5 hours in their case. The reason is “file not found” during GET_REPAIR.

Pentium100 · August 24, 2022, 9:47am

I am not worried about permanent data loss disqualifying my node. In that case it would not matter to me if it was 10 minutes or 10 days, there would be nothing I could do about it.

On the other hand, there may be situations where the data is inaccessible temporarily (io frozen, controller failed, backplane failed, usb cable fell out, node started with the wrong data directory etc), in this case the data would still be there, but every attempt to access it would fail, so I would only have 10 minutes or so to notice it and react.

SGC · August 24, 2022, 10:44am

again this new audit score is not suppose to fix such issues, there are other failsafes to take care of that…

even running with the old audit score system allowing only 10 consecutive audit failures, i haven’t had any issues, even when my zfs storage was stalled out for hours.

also i don’t think inaccessible count’s as failed exactly, if it did then i would have had nodes that failed long ago.

i get that you are worried about such things, as am i… but thus far i haven’t seen any signs that i should be worried about that on my setup.

however it is very worrying when the node keep running while the storage is basically inaccessible…

Pentium100 · August 24, 2022, 11:43am

Yeah, I am worried more about this than losing 4% of data (which would be something like 900GB for my node. It is possible, but it is more likely that my entire pool would fall apart than resulting in “just” a loss of 900GB of data.
900GB is quite a lot, I could even back up my node and my backups would be valid for something like a month.

On the other hand, it is possible to rsync a node to a different server, run the new one for a while and then start the old one for some reason (forgetting to turn off autostart) and not notice it for 10 minutes.
Or, having more than one node on a VM, start a node and give it the wrong directory and not notice that for 10 minutes.

SGC · August 24, 2022, 12:23pm

yeah i did start the same node twice one time… then they will share uptime… lol
basically it will switch between each node because only one will be accepted by the network…
so it will start to drop in uptime on both nodes, afaik.

and ofc after a while the files will be slightly different, but again… the odds of those being audited would be slim to none in most cases except for new nodes.

if you give a node the wrong folder, it will refuse to start, same if it is write protected or such…
or shutdown near immediately because it fails its write check.

Pentium100 · August 24, 2022, 1:56pm

Imagine I have two nodes A and B on the same VM and manage to start node A with the directory of node B.

In any case, the permanent disqualification (apparently very fast) for a temporary problem is what I am worried about te most. If I actually lose data, then DQ is the correct response and I would probably be more upset about losing the other data in the pool more than the node.