and for those not keeping up… this means audits scores should be much more stable…
tho nodes with more than 4% corrupted data will be DQ…
The new audit score is to avoid random unjustified DQ’s of nodes.
something like that, should be good for everyone… except if one has a node that has been lucky enough to survived with more than 4% corrupted data… if so then its bad news
Never imagined that would or could be possible: RANDOM + UNJUSTIFIED. Sounds like Russian roulette. Wait, this is equivocal these times… It’s like, nah, i don’t want to go to work today and you’re guilty.
Technically behind the scene? Or as displayed on the dashboard too?
If the dashboard keeps displaying the “real” value, it’s a problem IMHO: it was already weird and difficult to explain to new comers that a node gets DQed when dropping below the arbitrary 60% (instead of 0% as anyone would assume). If the new value is 96%, this score is hardly understandable anymore on the dashboard, is it?
So far, it does not appear that most people assume that the audit reputation can go down to 0% before a disqualification event. Think of the audit reputation as a measure of what percentage of data the satellite thinks your node has stored correctly.
But if you have more concrete recommendations for how to improve the UI, this forum would be a good place for that!
First off, DQ can indeed happen after 40 consecutive audit failures. This is up from only 10 before the change. But this wasn’t necessarily the intended goal of the changes. (@Alexey it might be good to mention that it is an increase from only 10 before in the top post)
From the original topic I outlined these issues:
Bottom line, the scores were extremely erratic even with minor data loss. Even if that data loss never changed, scores were jumping all over the place, giving node operators the impression that things got worse or better, while in fact the situation remained the same. And some nodes with significant data loss could survive, even though they shouldn’t have been allowed to.
We went through a lot of back and forth, but the new approach we landed on fixes all of these issues. The score now has a longer memory of old audits, making it change less rapidly and show as much more stable. The score of 96 much closer resembles the allowed data loss, making it a more meaningful number as well.
The adjustments we ended up arriving at were both aimed at fixing the problems listed before, as well as these guidelines set out by @thepaul
Basically the idea was to let all nodes up to 2% data loss survive, but DQ all nodes with 4% loss or more. Inbetween those numbers it’s kind of luck of the draw.
A node with 3% loss could survive for a long time, but may still eventually be disqualified if it runs into some bad luck causing more consecutive audits of lost pieces than you would normally expect.
So in summary:
The new scores more accurately show the actual percentage of data loss on your node
The scores are much more stable, meaning that a recovery of more than 1% means there has been an actual improvement. A drop of more than 1% means something has gotten worse.
There is no longer a lot of luck involved in whether your node will survive or not, DQ now closely depends on actual data loss
Nodes with more than acceptable data loss are no longer allowed to survive
As a small bonus, nodes with temporary issues causing audit failures take 4x longer to be disqualified (this is not a complete solution to the temporary problem issue, but it helps nonetheless)
Hope that clears things up!
And thanks again @thepaul (and everyone else working on this behind the scenes) for putting so much effort into this and working closely with the node operator community to arrive at these changes. I’m excited to see this be implemented now! I think it will be great for node operators as well as for the network.
but the odds of picking 40 pieces where they are all bad is astronomically low on a system with less than 4% data loss…
the odd’s should be something like 4% of 4% of 4% … 40 times in sequence.
so 4% x 0.04 = 0.16% thats the 2nd, and 3rd is 0.0064%
the 9th would be 0.00000000000262%
20th would be 1,09e-26% basically just think of the e-26 count as how many zeros goes in front.
now is a perfect time to introduce some scale.
lets take something we can maybe loosely imagine.
the estimated number of sand grains on earth.
9.6 x 10^13 x (8 x 10^12) = 4.6 x 10^23 sand grains
duno how accurate that is, but it seems around what i would have expected.
so from now on, we will be taking a 4% change of picking the same grain of sand twice in a row.
i think… or less, tho this example doesn’t last for long because in just a step or two a grain is way to much
and the 40th step takes us to 1.2e-54
so the chance that your node with 4% data loss attempts an audit on lost data 40 times in a row would be
or something like that…
We have at least one SNO, affected by this new score system. They have had more than 4% of data loss in the past (they also running dozens nodes on the one server, so dozens nodes were affected), the disqualifications are happened after 4.5 hours in their case. The reason is “file not found” during GET_REPAIR.
I am not worried about permanent data loss disqualifying my node. In that case it would not matter to me if it was 10 minutes or 10 days, there would be nothing I could do about it.
On the other hand, there may be situations where the data is inaccessible temporarily (io frozen, controller failed, backplane failed, usb cable fell out, node started with the wrong data directory etc), in this case the data would still be there, but every attempt to access it would fail, so I would only have 10 minutes or so to notice it and react.
Yeah, I am worried more about this than losing 4% of data (which would be something like 900GB for my node. It is possible, but it is more likely that my entire pool would fall apart than resulting in “just” a loss of 900GB of data.
900GB is quite a lot, I could even back up my node and my backups would be valid for something like a month.
On the other hand, it is possible to rsync a node to a different server, run the new one for a while and then start the old one for some reason (forgetting to turn off autostart) and not notice it for 10 minutes.
Or, having more than one node on a VM, start a node and give it the wrong directory and not notice that for 10 minutes.
yeah i did start the same node twice one time… then they will share uptime… lol
basically it will switch between each node because only one will be accepted by the network…
so it will start to drop in uptime on both nodes, afaik.
and ofc after a while the files will be slightly different, but again… the odds of those being audited would be slim to none in most cases except for new nodes.
if you give a node the wrong folder, it will refuse to start, same if it is write protected or such…
or shutdown near immediately because it fails its write check.
Imagine I have two nodes A and B on the same VM and manage to start node A with the directory of node B.
In any case, the permanent disqualification (apparently very fast) for a temporary problem is what I am worried about te most. If I actually lose data, then DQ is the correct response and I would probably be more upset about losing the other data in the pool more than the node.