Tuning audit scoring

In a recent topic the subject of data loss and resulting audit scores and disqualification came up. It triggered me to do some research into how the current system works and what fluctuations in scores are to be expected.

In order to investigate I created a simulator that simulates audits and the resulting scores based on an average data loss percentage. This is what a 15% data loss looks like with the current formulas:

This approach has several issues:

  • The scores are extremely volatile and unpredictable
  • Disqualification is inconsistent and the timing is entirely determined by random “luck”
  • A node operator might look at the score and see it increasing and assume issues are solved
  • Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
  • After failed audits, the score recovers just as fast, so failures never really stick

Now since this system works based on random audits, some fluctuations in scores are to be expected. But the approach used with a beta distribution actually allows you to tune for this. So, I would suggest raising the lambda value (which represents the “memory” of this process or how heavily the previous score counts in the newly calculated score) significantly to 0.999.


That fixes consistency, but now the node with 15% loss will definitely survive. To fix this we can alter the change in value when you fail an audit. I suggest changing that from -1 to -20. This counts failures much more heavily than successes. It does introduce a little more fluctuation back into this value, but not nearly as much as it used to be.

uh oh… our node with 15% data loss isn’t doing so well. But… should it do well? It lost 15% of data. It seems pretty reasonable to disqualify that node. In fact the numbers I suggest are tuned to allow for about 2% data loss maximum.

That looks more like it. There is still some randomness to it, but it is controlled. A node on the verge of getting disqualified won’t have scores jumping from 60% to 100% and back all the time, but instead the score at any time actually gives a good representation of the nodes data quality.

A public copy of the sheet I used can be found here: Storj Audit Score Simulator PUBLIC - Google Sheets
Anyone can edit, so please be respectful. Please don’t destroy the sheet. If someone is testing with tweaking the numbers, wait for them to finish or copy the sheet to your own Google account to play with it. After updating the numbers at the top it takes a while for all the values and the graph to be updated. To run a new simulation with the same numbers, just click an empty sell and press delete.

Now testing with 10000 audits is nice, but it doesn’t give the whole picture. I wanted to have a better idea about what could be expected long term. So I wrote a small python script to check with 100 million audits.

from random import random


l = 0.999
w = 1
sv = 1
fv = -20
dq = 0.6

loss = 0.02


a = 1/(1-l)
b = 0

for n in range(100000000):
    if random() < loss:
        v = fv
    else:
        v = sv
    a = l*a + (w*(1+v))/2
    b = l*b + (w*(1-v))/2
    score = a / (a+b)
    if score < dq:
        break
    
print("Score = {} after {} audits".format(score, n+1))

With 2% data loss the node tends to survive.

Score = 0.7858662201182666 after 100000000 audits

With 2.1% it generally gets disqualified at some point

Score = 0.5999645458940869 after 12795072 audits

And of course the higher the data loss the faster the node gets disqualified.

RECAP OF SUGGESTED CHANGES
So I suggest changing Lambda to 0.999 and the audit failure value to -20.

So lets take a look at my original issues with the current formula and see how the new formula deals with it:

  • The scores are extremely volatile and unpredictable
    Scores are now more stable and approach actual node quality over time
  • Disqualification is inconsistent and the timing is entirely determined by random “luck”
    Disqualification now consistently happens based on amount of data lost and the timeframes are more consistent
  • A node operator might look at the score and see it increasing and assume issues are solved
    The score at any time gives a good indication of how the node is doing now. Momentary fluctuations will still happen, but deviations from the mean are small.
  • Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
    Nodes with 3% or more data loss will now get disqualified promptly
  • After failed audits, the score recovers just as fast, so failures never really stick
    Recovering from a bad score now takes 20x as long as dropping the score on a failed audit does. As a result failed audits are more sticky and issues are represented for longer in the score. Therefor intermittent problems don’t go unnoticed.

Other considerations:

  • Raising the lambda value causes scores to not drop as fast as they did before, this could lead to really bad nodes not being disqualified as fast
    While this is true, it doesn’t take a lot longer, because this effect is largely compensated by the much higher impact of audit failures on the score. It takes about twice as long on average for nodes with 40%+ data loss to be disqualified. But the tradeoff is that nodes with 3%+ data loss do get disqualified. This should lead to a more reliable node base on aggregate.
  • Implementing a change like this could disqualify many nodes in a short period of time.
    To compensate for this I recommend first introducing this change with an audit fail value of -5 and incrementally lower it to -20 over time while keeping an eye on repair impact.

Bonus suggestion
Now that there is a more consistent representation of node data quality, we can actually do something new. We could mark the pieces of a node with an audit score below the warning threshold (for example 95%) as unhealthy. And at the same time lower the incoming traffic for that node (they could be part of the node selection process for unvetted nodes). This will result in data for that node slowly being repaired to other nodes, potentially reducing the troublesome data and fixing it, while at the same time lower the risk of new data stored on that node getting lost. This will rebalance good and bad data on the node and will allow nodes that have resolved the underlying issue to slowly recover, while at the same time nodes that still have issues will keep dropping in score anyway and fail soon enough. This also provides an additional incentive to keep your node at higher audit scores to keep all ingress and prevent losing data to repair.

An interesting read and I’m sure the data scientists at storjlabs have run many simulations already and will be able to provide more insight into their results (if they can take the time to share their thoughts).

About your proposed changes: This might work fine for bigger nodes but what about very young nodes that might still be in the vetting period and just had bad luck e.g. with a file corruption early on?

Young nodes will actually take a little longer to fail with these settings. And since they got a lot of incoming data compared to stored data, they are in the best scenario to overcome (outgrow) those issues. But in general these settings would be a little more forgiving early on due to a slower drop in scores.

looks like really great work, i think maybe the reason why it was made the way it was… in the past the disconnection of a hdd while the node was running was a possible event…

it doesn’t really seem a common event anymore with all the new failsafe, but the quick DQ of nodes will also making it possible that some errors which could have been solved, might DQ a node…

not sure if that needs to be taken into account today… but maybe it was why it was random and over long periods…

i think maybe there should be some kind of time limit to just how fast a node could in theory be DQ… i mean your example for 2.1% data loss would mean that if that happened to my node, then it would be DQ on a busy day in maybe a few hours…
since it will at times do like 2000 audits or 2500 audits in a day i think is the highest i’ve seen.

just taking the highest activity day this month. and i got 1426 audits in 24 hours…
so thats 8 hour if we assume the 60% mark is about 500 failed audits…

so in theory couldn’t my node have some kind of issue that looked like lost data, which then would get it … Promptly DQ :smiley:

in less time that it might take me to actually get to the node.

aside from that concern which i’m sure is a concern for all of us… would be bad if good nodes got DQ by mistake.

it does seem like a much better solution… but i don’t think its possible to know what kind of errors and issues to expect in the future… this would also make it possible for the entire network to get DQ in a day wouldn’t it?

i mean somebody found a way to push a % of data to the network which would end up looking like bad data … or whatever…

what are the security implications of that…

i think removing the flux is good, however that there is a certain randomness to the timing DQ of bad nodes might really not be a bug but more of a feature imho

it might react a bit to immediate. but i duno…
i will give this a vote because i want to see the volatility of the audit score removed, but i think there might be issues with this approach of yours.

and it should be taken into account that an ISP may have a connection failure.
Maybe if you meet DQ -value (message email or Dashbord) that you still have time X to solve the problem (if possible) before you get DQ

I said this in my previous comment, but I’ll repeat it again. It actually disqualifies nodes a little slower on catastrophic failure. With 100% audit failures the new formula would take about 40 audits to disqualify a node while the old one did it in about 10. This scenario should really only occur if the data is removed though, since there are now checks in place to kill or refuse to start the node when the data location isn’t available or doesn’t belong to this node. This is because we use a higher “memory” (lambda). My main concern here would not be disqualifying nodes too fast, but not fast enough. This is pretty much guaranteed to be an unrecoverable node and we should get rid of it quickly. I think however that implementing my bonus suggestion would prevent these situations from causing more harm than the current system, while at the same time still providing some time to fix a possible unforseen issue. Either way, my suggestion would be an improvement on the current implementation for this concern.

Referring back to what I already said, the new suggestions would smooth out the curve and drop scores a little slower. This basically implements that time limit, while at the same time not allowing significant data loss to exist on the network long term. Your example of 2.1% would take a very long time to lead to disqualification. Here’s one simulation I ran.

Score = 0.5999645458940869 after 12795072 audits

That’s nearly 13 million audits. But even 3% data loss would always take at least 1000 audits to fail and usually in the 10’s of thousands. When the data loss isn’t that large, there is a timeframe to recover from it in some way or another. (run fsck to fix file system issues for example)

In short, this change will do two things with regards to disqualification:

  • Slow down when nodes are disqualified
  • Lower the threshold of allowed data loss

So while more nodes will be disqualified, it actually takes a little longer for high data loss cases to be disqualified. Feel free to use the sheet I linked to run your own comparisons to verify this with different scenarios. And keep in mind that it simulates random data, so you may need to run a few scenarios to get a good feel of certain settings. If you’re still concerned, please provide a specific example that worries you. Because I think so far these concerns are based on wrong assumptions. It would be better to discuss specifics.

I have no idea what scenario you imagine here, so please be a little more specific. But given the above, it is highly likely that any scenario that could lead to that with these new settings, would lead to that faster with the old settings.

The new suggestion is more strict, but less aggressive. I think that’s the best way to sum it up.

This is about audit reputation only. Down time doesn’t factor into it. Audit reputation only takes into account audits while your node is online. Uptime scores are a separate system.

This feedback really belongs in this thread, so I’m answering here.

As a special treat for you I ran that simulation on the original settings.


So yeah, that fits. The erratic nature of these settings causes fairly deep peaks from time to time. But it’s infrequent and your node would indeed survive as you’ve seen.
To verify that I also ran 100 million simulated audits in python.

Score = 0.9678485255896826 after 100000000 audits

No argument there. In fact the new publicly available stats show that none of the satellites have ever lost a single segment. But being more strict, yet fair with nodes gives more room to for example reduce the expansion factor, which with the new pricing might help lower the negative impact on payouts. It’s best to finely tune all aspects to get the most out of the entire system. Additionally, my primary goal was to get rid of the erratic and unpredictable nature of this score as it really doesn’t tell the node operator how well the node is doing at all right now.

1 Like

Is there a level of data loss that is considered acceptable though?

Triggering a full repair on a node with 95% good data might not make economic sense

You’ve got to remember that it could be possible for none of the ‘good’ data to need to be repaired. There could be enough pieces still in the network.

It basically comes down to statistics then. And remember, I’m not arguing that the current setup isn’t reliable enough. It definitely is.

But before I get into the numbers, people are bad at statistics and get hung up on “so it CAN happen?”. Any data center deals with chances of loss of a file and Storj is no different. Eliminating chances is impossible. What you can do is know the risk and lower it. So with that in mind, lets get into it.

Example at a 5% loss:
The minimum availability of any segment on satellites is 48 pieces of which 29 need to be good (thanks to the new stats we know the 48 number now!)
So if 20 pieces are bad out of those 48, the segment is lost. We can now calculate a scenario where all nodes miss 5% of data (worst case)

(0.05^20)*(48 ncr 20) = 1.60E-13

So that’s an industry leading twelve 9’s of reliability. This is clearly plenty and this is not a realistic scenario. If you allow 5%, very few nodes will ever have 5% loss and the vast majority will have 0% or near 0% loss. It would be awesome if Storj Labs could publish aggregated audit success rates in deciles/percentiles so we can get some insight on the spread. But they will have much better numbers than I can work with.

But if nodes were more reliable in general say 2% loss, they could choose to only trigger repair at 42 pieces left and actually increase reliability. This would require repair to trigger more often and reduce expansion factor by up to a third. Both of which save a lot of money. The tradeoff is that some additional nodes may be disqualified. But these numbers are likely low and I’m pretty sure it would be a net positive. It also allows them to send more of the income from customers to node operators because they have to do less repair and have to store fewer bytes per piece uploaded.

All of this was not even the main reason I posted this though. You could easily change the audit fail value from -20 to -15 or -10 in my suggestion and keep the stability and predictability of the audit score, while allowing nodes with higher data loss percentages to survive. But since that goes at the cost of efficiency in the network, we all pay for allowing less reliable nodes. So I would prefer to be a little more strict.

This is some really excellent analysis!

To be honest, we haven’t been putting a lot of effort into evaluating audit scoring in the past year or so. It’s been “reliable enough” that we haven’t needed to go back and see how the evolving network fits with what we came up with originally.

I’m going to try and replicate your findings briefly, but if my results match yours I expect we’ll go ahead and make the suggested changes. Thank you for your very in-depth work!

9 Likes

Thanks for the response, that’s great news!
I agree that the current system has been good enough, especially for reliability. There was never any risk and that’s priority one.

Let me know if there is anything else I can do to help with this!

1 Like

As I mentioned over in (broken link), I believe we’re going to go ahead and make your suggested changes, or something very close to it. I’ll update again when that happens.

2 Likes

Your link though shows this

Oh dang, I guess that was a DM discussion. I thought it was a regular forum thread. I guess I’m not very good at this.

The gist of the message was simply:

I reproduced your results and have been modeling some scenarios to show how the change will affect different hypothetical SNOs. I still have some final tuning to do, but the change should be merged soon.

3 Likes