New audit scoring is live

BrightSilence · August 23, 2022, 11:52pm

I can give that a go.

First off, DQ can indeed happen after 40 consecutive audit failures. This is up from only 10 before the change. But this wasn’t necessarily the intended goal of the changes. (@Alexey it might be good to mention that it is an increase from only 10 before in the top post)

From the original topic I outlined these issues:

Bottom line, the scores were extremely erratic even with minor data loss. Even if that data loss never changed, scores were jumping all over the place, giving node operators the impression that things got worse or better, while in fact the situation remained the same. And some nodes with significant data loss could survive, even though they shouldn’t have been allowed to.

We went through a lot of back and forth, but the new approach we landed on fixes all of these issues. The score now has a longer memory of old audits, making it change less rapidly and show as much more stable. The score of 96 much closer resembles the allowed data loss, making it a more meaningful number as well.

The adjustments we ended up arriving at were both aimed at fixing the problems listed before, as well as these guidelines set out by @thepaul

Basically the idea was to let all nodes up to 2% data loss survive, but DQ all nodes with 4% loss or more. Inbetween those numbers it’s kind of luck of the draw.
A node with 3% loss could survive for a long time, but may still eventually be disqualified if it runs into some bad luck causing more consecutive audits of lost pieces than you would normally expect.

So in summary:

The new scores more accurately show the actual percentage of data loss on your node
The scores are much more stable, meaning that a recovery of more than 1% means there has been an actual improvement. A drop of more than 1% means something has gotten worse.
There is no longer a lot of luck involved in whether your node will survive or not, DQ now closely depends on actual data loss
Nodes with more than acceptable data loss are no longer allowed to survive
As a small bonus, nodes with temporary issues causing audit failures take 4x longer to be disqualified (this is not a complete solution to the temporary problem issue, but it helps nonetheless)

Hope that clears things up!

And thanks again @thepaul (and everyone else working on this behind the scenes) for putting so much effort into this and working closely with the node operator community to arrive at these changes. I’m excited to see this be implemented now! I think it will be great for node operators as well as for the network.