GET_AUDIT and GET_REPAIR affecting audits the same way. So, you can count how many audits are happening on your nodes. 40 failed consecutive GET_AUDIT and GET_REPAIR will be enough to disqualify your node.
That is awesome news…
and for those not keeping up… this means audits scores should be much more stable…
tho nodes with more than 4% corrupted data will be DQ…
The new audit score is to avoid random unjustified DQ’s of nodes.
something like that, should be good for everyone… except if one has a node that has been lucky enough to survived with more than 4% corrupted data… if so then its bad news
Never imagined that would or could be possible: RANDOM + UNJUSTIFIED. Sounds like Russian roulette. Wait, this is equivocal these times… It’s like, nah, i don’t want to go to work today and you’re guilty.
Thank you both.
Technically behind the scene? Or as displayed on the dashboard too?
If the dashboard keeps displaying the “real” value, it’s a problem IMHO: it was already weird and difficult to explain to new comers that a node gets DQed when dropping below the arbitrary 60% (instead of 0% as anyone would assume). If the new value is 96%, this score is hardly understandable anymore on the dashboard, is it?
So far, it does not appear that most people assume that the audit reputation can go down to 0% before a disqualification event. Think of the audit reputation as a measure of what percentage of data the satellite thinks your node has stored correctly.
But if you have more concrete recommendations for how to improve the UI, this forum would be a good place for that!
I can give that a go.
First off, DQ can indeed happen after 40 consecutive audit failures. This is up from only 10 before the change. But this wasn’t necessarily the intended goal of the changes. (@Alexey it might be good to mention that it is an increase from only 10 before in the top post)
From the original topic I outlined these issues:
Bottom line, the scores were extremely erratic even with minor data loss. Even if that data loss never changed, scores were jumping all over the place, giving node operators the impression that things got worse or better, while in fact the situation remained the same. And some nodes with significant data loss could survive, even though they shouldn’t have been allowed to.
We went through a lot of back and forth, but the new approach we landed on fixes all of these issues. The score now has a longer memory of old audits, making it change less rapidly and show as much more stable. The score of 96 much closer resembles the allowed data loss, making it a more meaningful number as well.
The adjustments we ended up arriving at were both aimed at fixing the problems listed before, as well as these guidelines set out by @thepaul
Basically the idea was to let all nodes up to 2% data loss survive, but DQ all nodes with 4% loss or more. Inbetween those numbers it’s kind of luck of the draw.
A node with 3% loss could survive for a long time, but may still eventually be disqualified if it runs into some bad luck causing more consecutive audits of lost pieces than you would normally expect.
So in summary:
- The new scores more accurately show the actual percentage of data loss on your node
- The scores are much more stable, meaning that a recovery of more than 1% means there has been an actual improvement. A drop of more than 1% means something has gotten worse.
- There is no longer a lot of luck involved in whether your node will survive or not, DQ now closely depends on actual data loss
- Nodes with more than acceptable data loss are no longer allowed to survive
- As a small bonus, nodes with temporary issues causing audit failures take 4x longer to be disqualified (this is not a complete solution to the temporary problem issue, but it helps nonetheless)
Hope that clears things up!
And thanks again @thepaul (and everyone else working on this behind the scenes) for putting so much effort into this and working closely with the node operator community to arrive at these changes. I’m excited to see this be implemented now! I think it will be great for node operators as well as for the network.
Some more highlights.
This is what scores would do over time with 15% data loss, before the implemented changes.
This is a graph of scores with 4% data loss with the new changes.
Note: the threshold line is at 95 here. This was bumped to 96 to better match the intended allowed data loss.
Here are some tests ran with the new settings in a simulation script built by @thepaul .
That script can be found in this post by him: Tuning audit scoring - #16 by thepaul
And these are the changes we eventually landed on for those who want to dive into the math:
Amazing summary, thank you! @BrightSilence
The point above includes the case, when there was a minor data loss in the past and it’s absolute value is shrinking due to new data uploaded since then?
Maybe here on the forum as this piece of info has been repeated many times. Most SNOs aren’t on the forum though.
We had extensive discussions on that matter 2 years ago. But the subject did not attract much attention from the community in the end (vote wise).
You might want to have a peak though Here:
Or, about 10 minutes for my node. That’s … fast.
but the odds of picking 40 pieces where they are all bad is astronomically low on a system with less than 4% data loss…
the odd’s should be something like 4% of 4% of 4% … 40 times in sequence.
so 4% x 0.04 = 0.16% thats the 2nd, and 3rd is 0.0064%
the 9th would be 0.00000000000262%
20th would be 1,09e-26% basically just think of the e-26 count as how many zeros goes in front.
now is a perfect time to introduce some scale.
lets take something we can maybe loosely imagine.
the estimated number of sand grains on earth.
9.6 x 10^13 x (8 x 10^12) = 4.6 x 10^23 sand grains
duno how accurate that is, but it seems around what i would have expected.
so from now on, we will be taking a 4% change of picking the same grain of sand twice in a row.
i think… or less, tho this example doesn’t last for long because in just a step or two a grain is way to much
and the 40th step takes us to 1.2e-54
so the chance that your node with 4% data loss attempts an audit on lost data 40 times in a row would be
or something like that…
it basically cannot happen …
We have at least one SNO, affected by this new score system. They have had more than 4% of data loss in the past (they also running dozens nodes on the one server, so dozens nodes were affected), the disqualifications are happened after 4.5 hours in their case. The reason is “file not found” during GET_REPAIR.
I am not worried about permanent data loss disqualifying my node. In that case it would not matter to me if it was 10 minutes or 10 days, there would be nothing I could do about it.
On the other hand, there may be situations where the data is inaccessible temporarily (io frozen, controller failed, backplane failed, usb cable fell out, node started with the wrong data directory etc), in this case the data would still be there, but every attempt to access it would fail, so I would only have 10 minutes or so to notice it and react.
again this new audit score is not suppose to fix such issues, there are other failsafes to take care of that…
even running with the old audit score system allowing only 10 consecutive audit failures, i haven’t had any issues, even when my zfs storage was stalled out for hours.
also i don’t think inaccessible count’s as failed exactly, if it did then i would have had nodes that failed long ago.
i get that you are worried about such things, as am i… but thus far i haven’t seen any signs that i should be worried about that on my setup.
however it is very worrying when the node keep running while the storage is basically inaccessible…
Yeah, I am worried more about this than losing 4% of data (which would be something like 900GB for my node. It is possible, but it is more likely that my entire pool would fall apart than resulting in “just” a loss of 900GB of data.
900GB is quite a lot, I could even back up my node and my backups would be valid for something like a month.
On the other hand, it is possible to rsync a node to a different server, run the new one for a while and then start the old one for some reason (forgetting to turn off autostart) and not notice it for 10 minutes.
Or, having more than one node on a VM, start a node and give it the wrong directory and not notice that for 10 minutes.
yeah i did start the same node twice one time… then they will share uptime… lol
basically it will switch between each node because only one will be accepted by the network…
so it will start to drop in uptime on both nodes, afaik.
and ofc after a while the files will be slightly different, but again… the odds of those being audited would be slim to none in most cases except for new nodes.
if you give a node the wrong folder, it will refuse to start, same if it is write protected or such…
or shutdown near immediately because it fails its write check.
Imagine I have two nodes A and B on the same VM and manage to start node A with the directory of node B.
In any case, the permanent disqualification (apparently very fast) for a temporary problem is what I am worried about te most. If I actually lose data, then DQ is the correct response and I would probably be more upset about losing the other data in the pool more than the node.
You can’t. It checks for matching node ID between identity and data. The node won’t start that way.
Understandable, but a lot of checks are already in place, like the one I mentioned above + readability/writeability checks. And I know dealing with IO timeouts was also in the works.
That said, this change wasn’t about that. However it does bring other big improvements we can all celebrate.
@thepaul are changes to the dashboard to incorporate the new scoring planned already?
Screenshots posted here: Your Storage Node on the us-central-1 satellite has been disqualified and can no longer host data on the network - #2 by faga
Shows red highlighting isn’t happening when a node is below 96% (not even yellow warning) and there is info text describing 60% as the threshold. I don’t see any changes on GitHub for this yet.
Perhaps given the relatively small allowed range (96-100), the yellow warning should probably appear below 99.5% or at least 99%.
Just checking that these adjustments are in scope.
Good call. @heunland noted this problem too. We’re getting it prioritized and assigned now.