Tuning audit scoring

BrightSilence · May 31, 2021, 7:27pm

In a recent topic the subject of data loss and resulting audit scores and disqualification came up. It triggered me to do some research into how the current system works and what fluctuations in scores are to be expected.

In order to investigate I created a simulator that simulates audits and the resulting scores based on an average data loss percentage. This is what a 15% data loss looks like with the current formulas:

This approach has several issues:

The scores are extremely volatile and unpredictable
Disqualification is inconsistent and the timing is entirely determined by random “luck”
A node operator might look at the score and see it increasing and assume issues are solved
Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
After failed audits, the score recovers just as fast, so failures never really stick

Now since this system works based on random audits, some fluctuations in scores are to be expected. But the approach used with a beta distribution actually allows you to tune for this. So, I would suggest raising the lambda value (which represents the “memory” of this process or how heavily the previous score counts in the newly calculated score) significantly to 0.999.

That fixes consistency, but now the node with 15% loss will definitely survive. To fix this we can alter the change in value when you fail an audit. I suggest changing that from -1 to -20. This counts failures much more heavily than successes. It does introduce a little more fluctuation back into this value, but not nearly as much as it used to be.

uh oh… our node with 15% data loss isn’t doing so well. But… should it do well? It lost 15% of data. It seems pretty reasonable to disqualify that node. In fact the numbers I suggest are tuned to allow for about 2% data loss maximum.

That looks more like it. There is still some randomness to it, but it is controlled. A node on the verge of getting disqualified won’t have scores jumping from 60% to 100% and back all the time, but instead the score at any time actually gives a good representation of the nodes data quality.

A public copy of the sheet I used can be found here: Storj Audit Score Simulator PUBLIC - Google Sheets
Anyone can edit, so please be respectful. Please don’t destroy the sheet. If someone is testing with tweaking the numbers, wait for them to finish or copy the sheet to your own Google account to play with it. After updating the numbers at the top it takes a while for all the values and the graph to be updated. To run a new simulation with the same numbers, just click an empty sell and press delete.

Now testing with 10000 audits is nice, but it doesn’t give the whole picture. I wanted to have a better idea about what could be expected long term. So I wrote a small python script to check with 100 million audits.

from random import random


l = 0.999
w = 1
sv = 1
fv = -20
dq = 0.6

loss = 0.02


a = 1/(1-l)
b = 0

for n in range(100000000):
    if random() < loss:
        v = fv
    else:
        v = sv
    a = l*a + (w*(1+v))/2
    b = l*b + (w*(1-v))/2
    score = a / (a+b)
    if score < dq:
        break
    
print("Score = {} after {} audits".format(score, n+1))

With 2% data loss the node tends to survive.

Score = 0.7858662201182666 after 100000000 audits

With 2.1% it generally gets disqualified at some point

Score = 0.5999645458940869 after 12795072 audits

And of course the higher the data loss the faster the node gets disqualified.

RECAP OF SUGGESTED CHANGES
So I suggest changing Lambda to 0.999 and the audit failure value to -20.

So lets take a look at my original issues with the current formula and see how the new formula deals with it:

The scores are extremely volatile and unpredictable
Scores are now more stable and approach actual node quality over time
Disqualification is inconsistent and the timing is entirely determined by random “luck”
Disqualification now consistently happens based on amount of data lost and the timeframes are more consistent
A node operator might look at the score and see it increasing and assume issues are solved
The score at any time gives a good indication of how the node is doing now. Momentary fluctuations will still happen, but deviations from the mean are small.
Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
Nodes with 3% or more data loss will now get disqualified promptly
After failed audits, the score recovers just as fast, so failures never really stick
Recovering from a bad score now takes 20x as long as dropping the score on a failed audit does. As a result failed audits are more sticky and issues are represented for longer in the score. Therefor intermittent problems don’t go unnoticed.

Other considerations:

Raising the lambda value causes scores to not drop as fast as they did before, this could lead to really bad nodes not being disqualified as fast
While this is true, it doesn’t take a lot longer, because this effect is largely compensated by the much higher impact of audit failures on the score. It takes about twice as long on average for nodes with 40%+ data loss to be disqualified. But the tradeoff is that nodes with 3%+ data loss do get disqualified. This should lead to a more reliable node base on aggregate.
Implementing a change like this could disqualify many nodes in a short period of time.
To compensate for this I recommend first introducing this change with an audit fail value of -5 and incrementally lower it to -20 over time while keeping an eye on repair impact.

Bonus suggestion
Now that there is a more consistent representation of node data quality, we can actually do something new. We could mark the pieces of a node with an audit score below the warning threshold (for example 95%) as unhealthy. And at the same time lower the incoming traffic for that node (they could be part of the node selection process for unvetted nodes). This will result in data for that node slowly being repaired to other nodes, potentially reducing the troublesome data and fixing it, while at the same time lower the risk of new data stored on that node getting lost. This will rebalance good and bad data on the node and will allow nodes that have resolved the underlying issue to slowly recover, while at the same time nodes that still have issues will keep dropping in score anyway and fail soon enough. This also provides an additional incentive to keep your node at higher audit scores to keep all ingress and prevent losing data to repair.

kevink · May 31, 2021, 7:45pm

An interesting read and I’m sure the data scientists at storjlabs have run many simulations already and will be able to provide more insight into their results (if they can take the time to share their thoughts).

About your proposed changes: This might work fine for bigger nodes but what about very young nodes that might still be in the vetting period and just had bad luck e.g. with a file corruption early on?

BrightSilence · May 31, 2021, 7:49pm

Young nodes will actually take a little longer to fail with these settings. And since they got a lot of incoming data compared to stored data, they are in the best scenario to overcome (outgrow) those issues. But in general these settings would be a little more forgiving early on due to a slower drop in scores.

SGC · June 1, 2021, 8:32am

looks like really great work, i think maybe the reason why it was made the way it was… in the past the disconnection of a hdd while the node was running was a possible event…

it doesn’t really seem a common event anymore with all the new failsafe, but the quick DQ of nodes will also making it possible that some errors which could have been solved, might DQ a node…

not sure if that needs to be taken into account today… but maybe it was why it was random and over long periods…

i think maybe there should be some kind of time limit to just how fast a node could in theory be DQ… i mean your example for 2.1% data loss would mean that if that happened to my node, then it would be DQ on a busy day in maybe a few hours…
since it will at times do like 2000 audits or 2500 audits in a day i think is the highest i’ve seen.

just taking the highest activity day this month. and i got 1426 audits in 24 hours…
so thats 8 hour if we assume the 60% mark is about 500 failed audits…

so in theory couldn’t my node have some kind of issue that looked like lost data, which then would get it … Promptly DQ

in less time that it might take me to actually get to the node.

aside from that concern which i’m sure is a concern for all of us… would be bad if good nodes got DQ by mistake.

it does seem like a much better solution… but i don’t think its possible to know what kind of errors and issues to expect in the future… this would also make it possible for the entire network to get DQ in a day wouldn’t it?

i mean somebody found a way to push a % of data to the network which would end up looking like bad data … or whatever…

what are the security implications of that…

i think removing the flux is good, however that there is a certain randomness to the timing DQ of bad nodes might really not be a bug but more of a feature imho

it might react a bit to immediate. but i duno…
i will give this a vote because i want to see the volatility of the audit score removed, but i think there might be issues with this approach of yours.

uwe.88 · June 1, 2021, 9:38am

and it should be taken into account that an ISP may have a connection failure.
Maybe if you meet DQ -value (message email or Dashbord) that you still have time X to solve the problem (if possible) before you get DQ

BrightSilence · June 1, 2021, 10:30am

I said this in my previous comment, but I’ll repeat it again. It actually disqualifies nodes a little slower on catastrophic failure. With 100% audit failures the new formula would take about 40 audits to disqualify a node while the old one did it in about 10. This scenario should really only occur if the data is removed though, since there are now checks in place to kill or refuse to start the node when the data location isn’t available or doesn’t belong to this node. This is because we use a higher “memory” (lambda). My main concern here would not be disqualifying nodes too fast, but not fast enough. This is pretty much guaranteed to be an unrecoverable node and we should get rid of it quickly. I think however that implementing my bonus suggestion would prevent these situations from causing more harm than the current system, while at the same time still providing some time to fix a possible unforseen issue. Either way, my suggestion would be an improvement on the current implementation for this concern.

Referring back to what I already said, the new suggestions would smooth out the curve and drop scores a little slower. This basically implements that time limit, while at the same time not allowing significant data loss to exist on the network long term. Your example of 2.1% would take a very long time to lead to disqualification. Here’s one simulation I ran.

Score = 0.5999645458940869 after 12795072 audits

That’s nearly 13 million audits. But even 3% data loss would always take at least 1000 audits to fail and usually in the 10’s of thousands. When the data loss isn’t that large, there is a timeframe to recover from it in some way or another. (run fsck to fix file system issues for example)

In short, this change will do two things with regards to disqualification:

Slow down when nodes are disqualified
Lower the threshold of allowed data loss

So while more nodes will be disqualified, it actually takes a little longer for high data loss cases to be disqualified. Feel free to use the sheet I linked to run your own comparisons to verify this with different scenarios. And keep in mind that it simulates random data, so you may need to run a few scenarios to get a good feel of certain settings. If you’re still concerned, please provide a specific example that worries you. Because I think so far these concerns are based on wrong assumptions. It would be better to discuss specifics.

I have no idea what scenario you imagine here, so please be a little more specific. But given the above, it is highly likely that any scenario that could lead to that with these new settings, would lead to that faster with the old settings.

The new suggestion is more strict, but less aggressive. I think that’s the best way to sum it up.

This is about audit reputation only. Down time doesn’t factor into it. Audit reputation only takes into account audits while your node is online. Uptime scores are a separate system.

BrightSilence · June 1, 2021, 9:48pm

This feedback really belongs in this thread, so I’m answering here.

As a special treat for you I ran that simulation on the original settings.

So yeah, that fits. The erratic nature of these settings causes fairly deep peaks from time to time. But it’s infrequent and your node would indeed survive as you’ve seen.
To verify that I also ran 100 million simulated audits in python.

Score = 0.9678485255896826 after 100000000 audits

No argument there. In fact the new publicly available stats show that none of the satellites have ever lost a single segment. But being more strict, yet fair with nodes gives more room to for example reduce the expansion factor, which with the new pricing might help lower the negative impact on payouts. It’s best to finely tune all aspects to get the most out of the entire system. Additionally, my primary goal was to get rid of the erratic and unpredictable nature of this score as it really doesn’t tell the node operator how well the node is doing at all right now.

andrew2.hart · June 3, 2021, 5:57am

Is there a level of data loss that is considered acceptable though?

Triggering a full repair on a node with 95% good data might not make economic sense

Stob · June 3, 2021, 8:57am

You’ve got to remember that it could be possible for none of the ‘good’ data to need to be repaired. There could be enough pieces still in the network.

BrightSilence · June 3, 2021, 9:54am

It basically comes down to statistics then. And remember, I’m not arguing that the current setup isn’t reliable enough. It definitely is.

But before I get into the numbers, people are bad at statistics and get hung up on “so it CAN happen?”. Any data center deals with chances of loss of a file and Storj is no different. Eliminating chances is impossible. What you can do is know the risk and lower it. So with that in mind, lets get into it.

Example at a 5% loss:
The minimum availability of any segment on satellites is 48 pieces of which 29 need to be good (thanks to the new stats we know the 48 number now!)
So if 20 pieces are bad out of those 48, the segment is lost. We can now calculate a scenario where all nodes miss 5% of data (worst case)

(0.05^20)*(48 ncr 20) = 1.60E-13

So that’s an industry leading twelve 9’s of reliability. This is clearly plenty and this is not a realistic scenario. If you allow 5%, very few nodes will ever have 5% loss and the vast majority will have 0% or near 0% loss. It would be awesome if Storj Labs could publish aggregated audit success rates in deciles/percentiles so we can get some insight on the spread. But they will have much better numbers than I can work with.

But if nodes were more reliable in general say 2% loss, they could choose to only trigger repair at 42 pieces left and actually increase reliability. This would require repair to trigger more often and reduce expansion factor by up to a third. Both of which save a lot of money. The tradeoff is that some additional nodes may be disqualified. But these numbers are likely low and I’m pretty sure it would be a net positive. It also allows them to send more of the income from customers to node operators because they have to do less repair and have to store fewer bytes per piece uploaded.

All of this was not even the main reason I posted this though. You could easily change the audit fail value from -20 to -15 or -10 in my suggestion and keep the stability and predictability of the audit score, while allowing nodes with higher data loss percentages to survive. But since that goes at the cost of efficiency in the network, we all pay for allowing less reliable nodes. So I would prefer to be a little more strict.

thepaul · June 4, 2021, 3:29pm

This is some really excellent analysis!

To be honest, we haven’t been putting a lot of effort into evaluating audit scoring in the past year or so. It’s been “reliable enough” that we haven’t needed to go back and see how the evolving network fits with what we came up with originally.

I’m going to try and replicate your findings briefly, but if my results match yours I expect we’ll go ahead and make the suggested changes. Thank you for your very in-depth work!

BrightSilence · June 4, 2021, 4:25pm

Thanks for the response, that’s great news!
I agree that the current system has been good enough, especially for reliability. There was never any risk and that’s priority one.

Let me know if there is anything else I can do to help with this!

thepaul · June 9, 2021, 2:40pm

As I mentioned over in (broken link), I believe we’re going to go ahead and make your suggested changes, or something very close to it. I’ll update again when that happens.

nerdatwork · June 9, 2021, 2:52pm

Your link though shows this

thepaul · June 9, 2021, 3:16pm

Oh dang, I guess that was a DM discussion. I thought it was a regular forum thread. I guess I’m not very good at this.

The gist of the message was simply:

I reproduced your results and have been modeling some scenarios to show how the change will affect different hypothetical SNOs. I still have some final tuning to do, but the change should be merged soon.

thepaul · February 9, 2022, 4:38pm

Coming back to this!

It’s been a long time and this has been through a lot of internal discussion. One of the first things I found was that it would be best for us to continue using the Beta Reputation Model as described here: https://www.cc.gatech.edu/fac/Charles.Isbell/classes/reading/papers/josang/JI2002-Bled.pdf . See also these papers describing how we are applying that model: Reputation Scoring Framework and Extending Ratios to Reputation. The benefit of sticking to this model, as explained to me by one of our data scientists, is that there is a solid mathematical underpinning to the reputations: a node’s reputation score is the chance that a random audit on that node will succeed, treating recent history as more relevant than distant history. This underpinning makes it easier to evaluate how well our parameters fit real life, and allows for the evolution of reputation scores to be included more simply in larger mathematical models. Based on that, it was probably no longer an option to use a different adjustment value for v on audit success versus failure.

Another feature of BrightSilence’s model was using a much larger initial value for alpha (1/(1-lambda) = 1000), rather than 1. This does have the effect of smoothing things out very considerably, but adopting this change would have been very difficult to apply to existing reputations.

I did some experimenting with other parameter sets, and decided that the main things we want to tune for are “how likely is disqualification for a node with acceptably low data loss” and “how likely is disqualification for a node with unacceptably high data loss”. To be fair, we want the former to be as low as feasibly possible, and the latter to be as high as feasibly possible. For the purposes of my calculations, I’ve used 2% data loss as “acceptably low” (this might be overly generous) and 4% data loss as “unacceptably high”.

I made a simulation tool to try out different sets of parameters within the Beta Reputation Model framework. You can find that as a Python script here, and some accumulated output from that script (as a CSV) here. Once we determined that we needed to keep the adjustments to alpha and beta as 1 and -1, I did some further investigation on the parameter sets that looked like the best fit, giving these final results: results of data loss versus DQ chance sim.csv. The last parameter set in that output is the one I like the best:

grace period = 50 audits (nodes won’t be disqualified during the first 50 audits)
lambda = 0.987 (the “forgetting factor”; raised from 0.95)
DQ threshold = 0.89 (raised from 0.6)

With these parameters, a node with 2% data loss or less has a 0% chance of disqualification within 10,000 rounds. A node with 4% data loss has a 25% chance of DQ, and a node with 5% data loss has an 80% chance of DQ.

Compare that to the existing parameters, where (because of the overly large swings described by @BrightSilence above) a node with 2% data loss has a 1.6% chance of DQ, and (because of the overly low DQ threshold) a node with 5% data loss has a 4.3% chance of DQ.

Therefore, what I propose now is making those changes (no DQ’s until 50 audits have completed, lambda = 0.987, and DQ threshold = 0.89). We’d make the lambda and grace period changes first, then wait for that to have an effect on scores before raising the DQ threshold.

Thoughts?

SGC · February 9, 2022, 6:48pm

anything that reduces the random chance of DQ is good…
tho i think the whole concept with counting a certain number of audits for the grace period or such is a flawed perspective, due to the fact that the more data a node has the more audits it will get.

the proposed grace period would for the case of my 16TB node be equal to about 30 minutes.

i think the grace period would be best as a fixed time period, to ensure that the grace period has a similar effect for all node sizes.

will try to better understand the implications of the other changes suggested, don’t really feel i’m familiar enough with the concepts to estimate it aside from what i already mentioned.

BrightSilence · February 9, 2022, 7:28pm

Thanks @thepaul for that great and extensive response and for sharing both your code and results!

I took some time to look through it all and play with the python script you provided. I’ll try to collect my initial thoughts here.

That is an absolutely fair criticism of my suggestion and I agree that because of this sticking with v=1 in both failure and success is not just preferable, but critical for ongoing monitoring and just for the resulting number to have any actual meaning. So, yes, no argument from me here. There are other, better ways to achieve our goals anyway.

I have to infer a little bit here why it is hard to adopt this. Is this because some of the scores may be quite low by chance using the current system and it would take really long to recover from that with alpha at 1000? If so, I guess you could give node reputations a reset. The worst thing that could happen is that bad nodes stick around slightly longer, but the new parameters will still make quick work of them if they are actually bad.

This of course has to be priority 1 for the reputation system. Agreed. This needs to be ensured before we look at other criteria. But I don’t think we should lose sight of those. I’ll get back to that in a bit.

I really appreciate you sharing this! Using your script already to run more simulations.

I’m not entirely sure what this is for. By setting the initial score to 1, the beta formula already kind of has a grace period built in, because it assumes a prior history of good audits and slowly adjusts from there. I’m not against it, but it’s just not entirely clear to me why it is needed.

These are great results indeed!

So I think this is great and I’m pleased to see a lot of thought has gone into this subject. I want to briefly get back to the “issues” I listed in my original post and go through them.

The scores are extremely volatile and unpredictable
Disqualification is inconsistent
The timing of disqualification is entirely determined by random “luck”
A node operator might look at the score and see it increasing and assume issues are solved
Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
After failed audits, the score recovers just as fast, so failures never really stick

I think the ones in bold are solved by the changes you propose. Though the last one doesn’t necessarily show a much bigger impact of failures on the score, the fact that the score dropped at all would be visible for longer.

But I want to address the remaining ones as well.

The scores are extremely volatile and unpredictable
The timing of disqualification is entirely determined by random “luck”
A node operator might look at the score and see it increasing and assume issues are solved

I think I can address all three at once. I ran a quick simulation with your suggested parameters and 4% data loss.

I zoomed in on the relevant part of the graph. Because now the margin that informs the node operator whether there is an issue and how bad that issue is has been reduced to 89-100%. As you can see, the score is unfortunately still all over the place within this margin. And there are moments (circled) when the node operator might think the problems are solved. Even though 4% loss is enough that the node will almost certainly eventually be disqualified. However, the timing of that disqualification is still very random.

So, what could be done about that? Well, the only way to stabilize the score is to increase the remembering factor lambda. In the end, what you really want to say is if a node loses more than 4%, it’s out! The only reason we’re setting the score lower than 96% is to account for volatility. So I would suggest raising the lambda to remove volatility and raising the disqualification threshold even more because we don’t need as much slack anymore. I ran the same simulation with lambda at 0.999 and the threshold at 95%.

Scores are more stable, disqualification is a little more predictable and a recovering score that shows more than 1% improvement almost certainly means the situation of the node has actually improved.
I used your script to run a few simulations with this:

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 4.00% data loss, 36.40% of runs hit dq (after 6183.83 rounds on avg)

./simrep.py -d 0.05 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 5.00% data loss, 100.00% of runs hit dq (after 3035.49 rounds on avg)

Tuning it even tighter by raising the threshold to 96% and you get exactly the result you wanted. No nodes disqualified at 2% data loss, all nodes disqualified at 4% data loss.

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.03 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 3.00% data loss, 21.73% of runs hit dq (after 6275.94 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 4.00% data loss, 99.77% of runs hit dq (after 2906.48 rounds on avg)

Ok, I guess 99.77% isn’t all, but I hope you’ll forgive me that last 0.23%.

Of course doing this would require to address the issue you mentioned with implementing this on top of existing scores. Could you perhaps expand on that a little?
Additionally it takes a few more audits to disqualify a node that lost all data, but as stated below… that may not be a bad thing.

Essentially the balance between high lambda and threshold vs low lambda and threshold is one between consistency and speed to DQ. I don’t know how important the speed to DQ is, but given that this adjustment would already ensure a lot more less reliable nodes don’t survive, I think it’s probably ok to have a little less speed of DQ.

There is one last concern I want to address.

What happens when temporary problems cause audit failures?

This may be especially relevant for the suspension score, which I assume would be aligned with the audit score system as to not provide an incentive for malicious node operators to redirect known audit failures to unknown audit failures. Intermittent issues in networks or hanging systems can temporarily cause multiple audit failures in a row that don’t reflect the durability of data on the node. Recently we’ve actually seen an increasing number of reports on the forum of nodes being suspended because of this.

With your suggested settings, it would take only 9 audits to be disqualified/suspended.

I took a quick look at my logs. That would give me just over 10 minutes to fix a temporary issue on my largest node for Saltlake.

2022-02-09T17:14:45.395Z        INFO    piecestore      download started        {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:14:45.576Z        INFO    piecestore      downloaded      {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.357Z        INFO    piecestore      download started        {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.524Z        INFO    piecestore      downloaded      {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.264Z        INFO    piecestore      download started        {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.426Z        INFO    piecestore      downloaded      {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.392Z        INFO    piecestore      download started        {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.557Z        INFO    piecestore      downloaded      {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.254Z        INFO    piecestore      download started        {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.433Z        INFO    piecestore      downloaded      {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.582Z        INFO    piecestore      download started        {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.762Z        INFO    piecestore      downloaded      {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.201Z        INFO    piecestore      download started        {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.357Z        INFO    piecestore      downloaded      {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.067Z        INFO    piecestore      download started        {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.260Z        INFO    piecestore      downloaded      {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:06.862Z        INFO    piecestore      download started        {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:07.240Z        INFO    piecestore      downloaded      {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}

Raising lambda would help with this, but I don’t think it solves the core issue. Implementing 0.999 would raise the number of sequential failed audits to 52 with a threshold of 95% or 41 with a threshold of 96%. On my node that would at best give me an hour to fix things. And well, depending on when such an issue occurs, I would most likely not be in time to fix it.

The problem is not what number of audits is required to DQ, but that it isn’t based on time instead. 52 audits is perhaps too long for new nodes, which can take weeks to get to that number, but way too short for larger nodes like mine. I don’t yet have a suggestion on how to solve this and perhaps it’s a separate problem that requires a separate solution. But I wanted to mention is as it seems your suggested numbers actually made that slightly worse, rather than better.

Thanks again for picking this subject up though @thepaul, looking forward to your response!

SGC · February 9, 2022, 8:32pm

as usual bright makes a very well considered argument, and one that i can only agree with.

the whole point of this was smooth out the audit score, so that nodes doesn’t get randomly DQ for no good reason, increasing lambda to 0.999 seems to do just that.

there will ofc always be some flux in something that has a inherent random nature (the audits)
and even if it requires a reset of all audit scores / reputations, then the damage from that should be minimal, especially considering that according to bright’s previous simulations of the audit / DQ system, it showed that it was possible for nodes with 12% data loss to survive.

implementing a new configuration of the Beta Reputation Model, which will rapidly DQ nodes with more than 4% data loss, that would certainly make up for a reputation reset long term.

one thing that comes to mind tho…

what would be the effect when that many nodes is DQ in such a short time…?

i mean there could be 1000 nodes with 12% data loss existing in the network. and DQ all those in a short time might lead to data loss for the network.

i would suggest easing into a changed lambda score, or atleast be sure the network can handle the possible very big change in node numbers.

(thinking a bit more about that… maybe the proper way is to increase the DQ threshold over time) to avoid a shock to the network.
then one could reset the reputations / adjust lambda, and raise the DQ threshold after in steps

so yeah long story short… i would also want to see lambda at 0.999 to decrease volatility.

thepaul · February 9, 2022, 9:16pm

Great points! Thank you for the swift response.

This is a good point—however, this period would only apply to brand new nodes, not even vetted yet, which presumably don’t have much data stored on them already, so audits will be a lot slower in coming!

Yeah, current scores in terms of alpha and beta would be quite low. It’s been a few months since I really looked at this, so I hope you’ll forgive me not remembering exactly where the math was. I couldn’t figure out a way to adapt existing scores. It’s not exactly clear how reputations work when the initial values are that high. Normally, the sum of alpha and beta converges upward toward 1/(1-lambda), which is 20 right now, so the highest alphas are near 20. When alpha starts higher than 1/(1-lambda), I’m not sure what happens. Probably (alpha+beta) diverges instead, which seems like it would have some important connotations but I’m not sure what they are yet.

Resetting scores is an option, like you said, but I felt it was kind of unfair to people who have built up a strong reputation already. If the community feels otherwise, maybe that’s an option if we figure out how the math works.

I admit I don’t have a solid reason for it in terms of the mathematical model. Only what I’ve found through simulation: nodes are a lot more likely to be DQ’d unfairly during the first several rounds, while (alpha+beta) is still low compared to 1/(1-lambda). I think this is an intentional feature of the system; audit failures early on are a legitimately bad sign, but when the goal is to minimize the chance of “unfair” DQ’s, the grace period seems to be particularly helpful.

Regarding volatile reputation scores: I’m less convinced now that this is a problem. Downward spikes, of course, only happen when an audit has legitimately failed, and on a well-behaved node this should happen less than 1% of the time. Having the downward spike be large might well be required by the nature of the problem. The upward spikes are not as steep, but if you zoom out far enough then anything with lots of adjustments to it will look pretty volatile, unless it converges to the right answer so slowly that it’s not helpful to us.

In the same vein, I don’t know if we can avoid having the timing of DQ be dictated by random luck. When you have some data loss, the system can’t possibly know or react to that fact until it tries to audit a damaged piece, and the timing of that is all luck.

Your results look promising to me and I would support adopting them if we can figure out whether it still represents the beta reputation model.

9 audits to be disqualified certainly does sound harsh. On the other hand, in most temporary failure scenarios I can think of (disk not responding, disk too slow, configuration error pointing at wrong directory or filesystem mount, bad permissions, filesystem has gone readonly, etc) the node can detect the problem and shut itself down so that the node can be subject to suspension instead of disqualification. A node can recover from suspension, so having that be the penalty for being offline that long seems pretty fair. But if there’s another specific failure scenario that we should be considering, let’s definitely talk about that.

Lambda at 0.999 isn’t quite the whole story- if I’m reading right, part of that smoothness comes from initializing alpha to 1000. I’m not sure if that breaks the beta reputation model or not. I have some more research to do I guess…