Tuning audit scoring

This is some really excellent analysis!

To be honest, we haven’t been putting a lot of effort into evaluating audit scoring in the past year or so. It’s been “reliable enough” that we haven’t needed to go back and see how the evolving network fits with what we came up with originally.

I’m going to try and replicate your findings briefly, but if my results match yours I expect we’ll go ahead and make the suggested changes. Thank you for your very in-depth work!

10 Likes

Thanks for the response, that’s great news!
I agree that the current system has been good enough, especially for reliability. There was never any risk and that’s priority one.

Let me know if there is anything else I can do to help with this!

1 Like

As I mentioned over in (broken link), I believe we’re going to go ahead and make your suggested changes, or something very close to it. I’ll update again when that happens.

3 Likes

Your link though shows this

Oh dang, I guess that was a DM discussion. I thought it was a regular forum thread. I guess I’m not very good at this.

The gist of the message was simply:

I reproduced your results and have been modeling some scenarios to show how the change will affect different hypothetical SNOs. I still have some final tuning to do, but the change should be merged soon.

4 Likes

Coming back to this!

It’s been a long time and this has been through a lot of internal discussion. One of the first things I found was that it would be best for us to continue using the Beta Reputation Model as described here: https://www.cc.gatech.edu/fac/Charles.Isbell/classes/reading/papers/josang/JI2002-Bled.pdf . See also these papers describing how we are applying that model: Reputation Scoring Framework and Extending Ratios to Reputation. The benefit of sticking to this model, as explained to me by one of our data scientists, is that there is a solid mathematical underpinning to the reputations: a node’s reputation score is the chance that a random audit on that node will succeed, treating recent history as more relevant than distant history. This underpinning makes it easier to evaluate how well our parameters fit real life, and allows for the evolution of reputation scores to be included more simply in larger mathematical models. Based on that, it was probably no longer an option to use a different adjustment value for v on audit success versus failure.

Another feature of BrightSilence’s model was using a much larger initial value for alpha (1/(1-lambda) = 1000), rather than 1. This does have the effect of smoothing things out very considerably, but adopting this change would have been very difficult to apply to existing reputations.

I did some experimenting with other parameter sets, and decided that the main things we want to tune for are “how likely is disqualification for a node with acceptably low data loss” and “how likely is disqualification for a node with unacceptably high data loss”. To be fair, we want the former to be as low as feasibly possible, and the latter to be as high as feasibly possible. For the purposes of my calculations, I’ve used 2% data loss as “acceptably low” (this might be overly generous) and 4% data loss as “unacceptably high”.

I made a simulation tool to try out different sets of parameters within the Beta Reputation Model framework. You can find that as a Python script here, and some accumulated output from that script (as a CSV) here. Once we determined that we needed to keep the adjustments to alpha and beta as 1 and -1, I did some further investigation on the parameter sets that looked like the best fit, giving these final results: results of data loss versus DQ chance sim.csv. The last parameter set in that output is the one I like the best:

  1. grace period = 50 audits (nodes won’t be disqualified during the first 50 audits)
  2. lambda = 0.987 (the “forgetting factor”; raised from 0.95)
  3. DQ threshold = 0.89 (raised from 0.6)

With these parameters, a node with 2% data loss or less has a 0% chance of disqualification within 10,000 rounds. A node with 4% data loss has a 25% chance of DQ, and a node with 5% data loss has an 80% chance of DQ.

Compare that to the existing parameters, where (because of the overly large swings described by @BrightSilence above) a node with 2% data loss has a 1.6% chance of DQ, and (because of the overly low DQ threshold) a node with 5% data loss has a 4.3% chance of DQ.

Therefore, what I propose now is making those changes (no DQ’s until 50 audits have completed, lambda = 0.987, and DQ threshold = 0.89). We’d make the lambda and grace period changes first, then wait for that to have an effect on scores before raising the DQ threshold.

Thoughts?

3 Likes

anything that reduces the random chance of DQ is good…
tho i think the whole concept with counting a certain number of audits for the grace period or such is a flawed perspective, due to the fact that the more data a node has the more audits it will get.

the proposed grace period would for the case of my 16TB node be equal to about 30 minutes.
image

i think the grace period would be best as a fixed time period, to ensure that the grace period has a similar effect for all node sizes.

will try to better understand the implications of the other changes suggested, don’t really feel i’m familiar enough with the concepts to estimate it aside from what i already mentioned.

Thanks @thepaul for that great and extensive response and for sharing both your code and results!

I took some time to look through it all and play with the python script you provided. I’ll try to collect my initial thoughts here.

That is an absolutely fair criticism of my suggestion and I agree that because of this sticking with v=1 in both failure and success is not just preferable, but critical for ongoing monitoring and just for the resulting number to have any actual meaning. So, yes, no argument from me here. There are other, better ways to achieve our goals anyway.

I have to infer a little bit here why it is hard to adopt this. Is this because some of the scores may be quite low by chance using the current system and it would take really long to recover from that with alpha at 1000? If so, I guess you could give node reputations a reset. The worst thing that could happen is that bad nodes stick around slightly longer, but the new parameters will still make quick work of them if they are actually bad.

This of course has to be priority 1 for the reputation system. Agreed. This needs to be ensured before we look at other criteria. But I don’t think we should lose sight of those. I’ll get back to that in a bit.

I really appreciate you sharing this! Using your script already to run more simulations.

I’m not entirely sure what this is for. By setting the initial score to 1, the beta formula already kind of has a grace period built in, because it assumes a prior history of good audits and slowly adjusts from there. I’m not against it, but it’s just not entirely clear to me why it is needed.

These are great results indeed!

So I think this is great and I’m pleased to see a lot of thought has gone into this subject. I want to briefly get back to the “issues” I listed in my original post and go through them.

  • The scores are extremely volatile and unpredictable
  • Disqualification is inconsistent
  • The timing of disqualification is entirely determined by random “luck”
  • A node operator might look at the score and see it increasing and assume issues are solved
  • Nodes with over 10% of data lost are allowed to survive for very long periods of time (nearly indefinitely)
  • After failed audits, the score recovers just as fast, so failures never really stick

I think the ones in bold are solved by the changes you propose. Though the last one doesn’t necessarily show a much bigger impact of failures on the score, the fact that the score dropped at all would be visible for longer.

But I want to address the remaining ones as well.

  • The scores are extremely volatile and unpredictable
  • The timing of disqualification is entirely determined by random “luck”
  • A node operator might look at the score and see it increasing and assume issues are solved

I think I can address all three at once. I ran a quick simulation with your suggested parameters and 4% data loss.


I zoomed in on the relevant part of the graph. Because now the margin that informs the node operator whether there is an issue and how bad that issue is has been reduced to 89-100%. As you can see, the score is unfortunately still all over the place within this margin. And there are moments (circled) when the node operator might think the problems are solved. Even though 4% loss is enough that the node will almost certainly eventually be disqualified. However, the timing of that disqualification is still very random.

So, what could be done about that? Well, the only way to stabilize the score is to increase the remembering factor lambda. In the end, what you really want to say is if a node loses more than 4%, it’s out! The only reason we’re setting the score lower than 96% is to account for volatility. So I would suggest raising the lambda to remove volatility and raising the disqualification threshold even more because we don’t need as much slack anymore. I ran the same simulation with lambda at 0.999 and the threshold at 95%.


Scores are more stable, disqualification is a little more predictable and a recovering score that shows more than 1% improvement almost certainly means the situation of the node has actually improved.
I used your script to run a few simulations with this:

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 4.00% data loss, 36.40% of runs hit dq (after 6183.83 rounds on avg)

./simrep.py -d 0.05 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.95 -a 1000 -b 0 -g 50
with 5.00% data loss, 100.00% of runs hit dq (after 3035.49 rounds on avg)

Tuning it even tighter by raising the threshold to 96% and you get exactly the result you wanted. No nodes disqualified at 2% data loss, all nodes disqualified at 4% data loss.

./simrep.py -d 0.02 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 2.00% data loss, 0.00% of runs hit dq (after 0.00 rounds on avg)

./simrep.py -d 0.03 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 3.00% data loss, 21.73% of runs hit dq (after 6275.94 rounds on avg)

./simrep.py -d 0.04 -r 10000 -x 3000 -l 0.999 -w 1 -q 0.96 -a 1000 -b 0 -g 50
with 4.00% data loss, 99.77% of runs hit dq (after 2906.48 rounds on avg)

Ok, I guess 99.77% isn’t all, but I hope you’ll forgive me that last 0.23%. :wink:

Of course doing this would require to address the issue you mentioned with implementing this on top of existing scores. Could you perhaps expand on that a little?
Additionally it takes a few more audits to disqualify a node that lost all data, but as stated below… that may not be a bad thing.

Essentially the balance between high lambda and threshold vs low lambda and threshold is one between consistency and speed to DQ. I don’t know how important the speed to DQ is, but given that this adjustment would already ensure a lot more less reliable nodes don’t survive, I think it’s probably ok to have a little less speed of DQ.

There is one last concern I want to address.

What happens when temporary problems cause audit failures?

This may be especially relevant for the suspension score, which I assume would be aligned with the audit score system as to not provide an incentive for malicious node operators to redirect known audit failures to unknown audit failures. Intermittent issues in networks or hanging systems can temporarily cause multiple audit failures in a row that don’t reflect the durability of data on the node. Recently we’ve actually seen an increasing number of reports on the forum of nodes being suspended because of this.

With your suggested settings, it would take only 9 audits to be disqualified/suspended.

image

I took a quick look at my logs. That would give me just over 10 minutes to fix a temporary issue on my largest node for Saltlake.

2022-02-09T17:14:45.395Z        INFO    piecestore      download started        {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:14:45.576Z        INFO    piecestore      downloaded      {"Piece ID": "YPQSHALIEQZEIRDCI2TAC4QFKAC4BR7X5NCAMYQ63AUZRVL5CPXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.357Z        INFO    piecestore      download started        {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:17:33.524Z        INFO    piecestore      downloaded      {"Piece ID": "4HGXC2QJBZZ4OF3XOZ2EOYOKJUF4X2WXW3DOVK3B5L2RQLM7O2KQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.264Z        INFO    piecestore      download started        {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:14.426Z        INFO    piecestore      downloaded      {"Piece ID": "NUP3WV4EHE2TKYRAHOMJNAYVJJLIAR4HDXY2DAXW4EGYO3FNB44A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.392Z        INFO    piecestore      download started        {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:39.557Z        INFO    piecestore      downloaded      {"Piece ID": "3HKYXF5CHTH24JLBK6GDXE7C3Z6R6UBHLS6LMQ2E3XJW2E3RZFOQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.254Z        INFO    piecestore      download started        {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:18:50.433Z        INFO    piecestore      downloaded      {"Piece ID": "QUTGWH4E7OLDF5WCYMJMCOVQS6QT5GRZEOKFHKAG4WZJLXPOOLKA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.582Z        INFO    piecestore      download started        {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:36.762Z        INFO    piecestore      downloaded      {"Piece ID": "MVK7W4PKFYJ7IGOKZ3MDY64I32TOGIF5CZO4IPKIWYLHX6YLHJFA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.201Z        INFO    piecestore      download started        {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:20:41.357Z        INFO    piecestore      downloaded      {"Piece ID": "3G5GZIVNQWRCYXS3ZOLGNNX5B5LP3RRG4BK75CLS64RKGSL2JUXA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.067Z        INFO    piecestore      download started        {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:23:41.260Z        INFO    piecestore      downloaded      {"Piece ID": "4AHKUKGZ2OBELZRHQB6UPDF5K2OJZSAAZDPEJGRJHX22L57FKJKQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:06.862Z        INFO    piecestore      download started        {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}
2022-02-09T17:25:07.240Z        INFO    piecestore      downloaded      {"Piece ID": "MZ4VDL6NLGOVQ27ENNPHPUTNZAREL3OACRGGLTK3MZRM45NKEAYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_AUDIT"}

Raising lambda would help with this, but I don’t think it solves the core issue. Implementing 0.999 would raise the number of sequential failed audits to 52 with a threshold of 95% or 41 with a threshold of 96%. On my node that would at best give me an hour to fix things. And well, depending on when such an issue occurs, I would most likely not be in time to fix it.

The problem is not what number of audits is required to DQ, but that it isn’t based on time instead. 52 audits is perhaps too long for new nodes, which can take weeks to get to that number, but way too short for larger nodes like mine. I don’t yet have a suggestion on how to solve this and perhaps it’s a separate problem that requires a separate solution. But I wanted to mention is as it seems your suggested numbers actually made that slightly worse, rather than better.

Thanks again for picking this subject up though @thepaul, looking forward to your response!

5 Likes

as usual bright makes a very well considered argument, and one that i can only agree with.

the whole point of this was smooth out the audit score, so that nodes doesn’t get randomly DQ for no good reason, increasing lambda to 0.999 seems to do just that.

there will ofc always be some flux in something that has a inherent random nature (the audits)
and even if it requires a reset of all audit scores / reputations, then the damage from that should be minimal, especially considering that according to bright’s previous simulations of the audit / DQ system, it showed that it was possible for nodes with 12% data loss to survive.

implementing a new configuration of the Beta Reputation Model, which will rapidly DQ nodes with more than 4% data loss, that would certainly make up for a reputation reset long term.

one thing that comes to mind tho…

what would be the effect when that many nodes is DQ in such a short time…?

i mean there could be 1000 nodes with 12% data loss existing in the network. and DQ all those in a short time might lead to data loss for the network.

i would suggest easing into a changed lambda score, or atleast be sure the network can handle the possible very big change in node numbers.

(thinking a bit more about that… maybe the proper way is to increase the DQ threshold over time) to avoid a shock to the network.
then one could reset the reputations / adjust lambda, and raise the DQ threshold after in steps

so yeah long story short… i would also want to see lambda at 0.999 to decrease volatility.

1 Like

Great points! Thank you for the swift response.

This is a good point—however, this period would only apply to brand new nodes, not even vetted yet, which presumably don’t have much data stored on them already, so audits will be a lot slower in coming!

Yeah, current scores in terms of alpha and beta would be quite low. It’s been a few months since I really looked at this, so I hope you’ll forgive me not remembering exactly where the math was. I couldn’t figure out a way to adapt existing scores. It’s not exactly clear how reputations work when the initial values are that high. Normally, the sum of alpha and beta converges upward toward 1/(1-lambda), which is 20 right now, so the highest alphas are near 20. When alpha starts higher than 1/(1-lambda), I’m not sure what happens. Probably (alpha+beta) diverges instead, which seems like it would have some important connotations but I’m not sure what they are yet.

Resetting scores is an option, like you said, but I felt it was kind of unfair to people who have built up a strong reputation already. If the community feels otherwise, maybe that’s an option if we figure out how the math works.

I admit I don’t have a solid reason for it in terms of the mathematical model. Only what I’ve found through simulation: nodes are a lot more likely to be DQ’d unfairly during the first several rounds, while (alpha+beta) is still low compared to 1/(1-lambda). I think this is an intentional feature of the system; audit failures early on are a legitimately bad sign, but when the goal is to minimize the chance of “unfair” DQ’s, the grace period seems to be particularly helpful.

Regarding volatile reputation scores: I’m less convinced now that this is a problem. Downward spikes, of course, only happen when an audit has legitimately failed, and on a well-behaved node this should happen less than 1% of the time. Having the downward spike be large might well be required by the nature of the problem. The upward spikes are not as steep, but if you zoom out far enough then anything with lots of adjustments to it will look pretty volatile, unless it converges to the right answer so slowly that it’s not helpful to us.

In the same vein, I don’t know if we can avoid having the timing of DQ be dictated by random luck. When you have some data loss, the system can’t possibly know or react to that fact until it tries to audit a damaged piece, and the timing of that is all luck.

Your results look promising to me and I would support adopting them if we can figure out whether it still represents the beta reputation model.

9 audits to be disqualified certainly does sound harsh. On the other hand, in most temporary failure scenarios I can think of (disk not responding, disk too slow, configuration error pointing at wrong directory or filesystem mount, bad permissions, filesystem has gone readonly, etc) the node can detect the problem and shut itself down so that the node can be subject to suspension instead of disqualification. A node can recover from suspension, so having that be the penalty for being offline that long seems pretty fair. But if there’s another specific failure scenario that we should be considering, let’s definitely talk about that.

Lambda at 0.999 isn’t quite the whole story- if I’m reading right, part of that smoothness comes from initializing alpha to 1000. I’m not sure if that breaks the beta reputation model or not. I have some more research to do I guess…

2 Likes

As far as I know, if the disk stops responding (or the kernel IO system freezes), the node will keep running, timing out all requests and, with the current system, be disqualified within 4 hours.
I think that 4 hours is way too short time for this and reducing it even more is wrong.
Unless the node CAN shut down if it gets multiple IO timeouts in a row, but then if 8 failed audits lead to DQ, then the node has to pretty much shut down after the first IO timeout (so that the node operator would have 7 attempts, though more likely 3, to restart it).

4 Likes

yeah changing these kinds of things are / can be very dangerous, if done without enough understanding of the future behavior.

you say its unlikely to affect older nodes… but i have a node on which i lost like a few minutes worth of files because i accidentally ran an rsync --delete instead of just rsync after a migration.
i know exactly how much time of ingress was lost, and still the score was all over the place.

that node’s audit score tho stable, would until recently vary by about 5%…
it’s been a while since i check how much flux it has now… but i will…
and that is a node storing millions of files, and its not like it unheard of that nodes get DQ by random chance… and that’s where my problem lies with the current audit / DQ configuration.

it jumps around, and really that is just a math thing, there is no good reason it should do that…

there is nothing wrong with a node getting DQ fast, but seeing as DQ is a hard limit and if the node hits it… having the score jump up and down is just unacceptable…

and for those of us with nodes ranging in the years old, it is downright torture to watch…
granted i don’t have that problem, but i would like to know that some day in the future my node won’t just end up being DQ because of a number that doesn’t truly reflect the state of the data.

If this is the case, it can be fixed fairly easily. There’s a regular “writability check” that the node does to make sure that its data mount point is still present and writable. If the node fails that check, it will shut down. If that doesn’t already have a timeout, we can add one.

I’m not following this- are you responding to something I said? The only thing I said about older nodes was that the grace period would not apply to them.

I should really emphasize though that the reputation score is, by absolute necessity, probabilistic, and it has to swing both above and below the true data-loss measure. If it doesn’t dip fairly quickly to negative audits, then bad nodes will affect the network for too long.

1 Like

I think I can respond to all of these by saying that I may have made the wrong assumption that alpha would be initialized at 1/(1-lambda). If that is not the case there may be some extra considerations indeed. With alpha initially at 1 you do indeed get some volatility early on. But that still stabilizes quite quickly. Here’s an example with 2% data loss.

You can see that the first hundred audits or so result in quite volatile score changes. In this case leading to disqualification of a 2% loss node. However, this would happen while the node is still in vetting and perhaps some additional scrutiny is warranted during that time anyway.
However, I would argue that this initialization at 1000 isn’t essential just because it makes the first few hundred audits a little less stable. By the time nodes get vetted the score will have stabilized enough to be a reliably indicator. And fluctuations during vetting can be easily explained by just saying that the score doesn’t have enough data to stabilize yet and that’s exactly what the vetting period is for.

Any initialization of alpha is essentially giving the model a history that isn’t there yet. In this case even setting the value at 1 is akin to saying the node had 1 successful audit. Setting it to 1 is just a way of avoiding the initial division by 0 if you set it to 0.
Initializing both alpha and beta can be done with prior data if available as well. Say you know a node has had 100 successful audits an 2 failed. You could set alpha and beta to those values to seed the model with “perfect memory” over those stats. I don’t believe doing something like that breaks the model in any way.

I think this one was referring to the lambda change specifically. This change just reflects giving the model more memory. Instead of mostly counting recent reputation, you look a bit further back. Essentially what you’re doing is using a larger set of audit results going a bit further back to determine reliability. The number still represents the same estimate, but based on a longer period of time. As far as I’m aware, that is an entirely valid way to tune the beta reputation model.

Having build up a strong reputation is relative. If you look at the current model, the strongest reputation you can have is with an alpha at 20 and beta at 0. Nodes that come out of vetting after just 100 audits already have an alpha of 19.88 and beta of 0 as long as they haven’t failed any audits. So really that reputation isn’t all that meaningful. On good nodes, that would be a valid starting point to just start building an actually strong reputation with a higher lambda.
The problem comes in with nodes that are now looking pretty bad just because of nodes. Here’s an example of a worst case scenario node with 2% data loss, but due to bad luck having an alpha of 12 and a beta of 8 (so a score of the 60, the current DQ threshold).

The score still fairly quickly adjusts to where it should be considering the 2% data loss. But in this example it takes about 200 audits to get above the threshold and about 400 total to get close to where the score should be. (rough estimates)

So, you could raise lambda, wait a few months and then raise the threshold. Or raise the threshold gradually over time. Or raise both gradually over time to compensate. This would all depend on how much of a risk you would be taking when raising just the lambda and not the threshold for a period of time.
If it were up to me, I would probably choose to raise the lambda and use aggregate node stats to see when raising the threshold is reasonable and slowly raise it over time. If you would raise the threshold by 10 per month, there isn’t a vetted node that would get into trouble because of that unless they had more data loss than acceptable.

So 2 remarks on this. The volatile scores are a problem for node operators who try to figure out what is wrong with their node and use the scores to gauge whether they fixed an underlying issue. At the moment and to a lesser extent with your suggested changes, it is very well possible that they see the score increase and think everything is now all good and fixed. Only to see it plummet afterwards, or worse, to have stopped paying attention and be surprised by disqualification later.
As for the spikes downward, they aren’t because of a single audit, but because of several failed audits less spread out than they are on average. They are essentially noise because while the score should converge to the actual failure rate, the lack of memory means it’s too “distracted” by recent audits to actually settle there.

I agree that those should be solved separately, though I’m not confident all of them have been accounted for. Specifically a mount point disappearing while the node is running. I remember there is a periodic check for this, but as I previously mentioned, those 9 audits would happen in about 10 minutes on the largest satellite on my node. So unless that periodic check happens more frequently than that, it could be too late to prevent disqualification. As for suspension… if you quickly get into suspension, but you can also recover fairly quickly, I would personally be fine with that. Though I think you would also have to factor in the cost of that since it would still mark pieces as unhealthy and trigger repair.
I do think it’s vital to align how suspension and audit scores work, both to avoid abuse and to just make things simpler to understand for node operators.

Technically @thepaul’s suggestion would make it worse… but only by one audit. Currently DQ takes at least 10 failures in a row from a perfect reputation, with the changes that would be 9. The chance that your nodes survival hinges on that 10th one are pretty small. :slight_smile: It’s just something I wanted to mention, but not necessarily a new issue with the newly suggested changes.

Agreed, are there any guidelines as to what would be considered too long for this though?

Ps. I feel like I should point out values between 0.987 and 0.999 exist. It’s not an either or situation. Compromise is possible to find the best of both worlds. I’m just using 0.999 as an example. And I quite like it because it seems with that setting you don’t see random swings larger than 1%. Which gives us a clear number to point at when trying to improve your node. Any improvement larger than 1% probably means you did something right. That would be slightly harder to explain if you have to say 1.5% for example.

1 Like

I want to clarify something before this discussion details because of confusing things.

There are currently 3 different reputation related scores

Audit score

Drops when known failures happen
Examples: Node responds to audit with: “I don’t have that piece”. Node responds to audit with corrupt or wrong data.
When below threshold: Permanent disqualification

Suspension score

Drops when unknown failures happen
Examples: Node responds with connection error or other unexpected error. Basically any error that isn’t missing or corrupt data.
When below threshold: Temporary suspension

Online score

Drops when nodes is offline
Examples: Offline node doesn’t respond to the audit at all.
When below threshold: Temporary suspension

So far we’ve mostly been talking about the first one, but @CutieePie, your concerns would almost all lead to the second and third.

While the suspension score mechanism should probably be updated as well, the fact that that would only cause temporary suspension means you can easily recover from that. When problems are solved.

From what I can tell, most of the temporary issues that could lead to audit scores dropping or disqualification are now being detected and I rarely see people complaining about disqualification without data loss. It doesn’t help that some people say they’ve been disqualified when they have actually been suspended.

So while I share your concern to some extent, I don’t want @thepaul’s suggestions to be misrepresented as worse, just because it doesn’t fix a pre-existing concern that really wouldn’t change much if these suggestions are implemented.

As for whether this is important now. Well I brought it up, so clearly I think it is. Inconsistent punishment of nodes based on random scores is demoralizing band discouraging. Especially if those same inconsistencies let other nodes with much worse situations slip through the cracks. Further more, a more reliable base of nodes leads to less overhead, less costs and perhaps to us reliable node operators being able to get a bigger slice of the pie. Making the whole system more reliable and efficient is good for everyone involved.

3 Likes

I don’t think this is quite accurate. It isn’t as straightforward as “alpha = number of successes, beta = number of failures”. A failure changes both alpha and beta, and so does a success. You are correct that giving a higher initial alpha is seeding a positive history, but within the model as we’ve been using it, (alpha+beta) converges up toward 1/(1-lambda) as time goes on. The closer alpha is to 1/(1-lambda), the stronger the reputation history. But when alpha or beta is already equal to or greater than 1/(1-lambda), I don’t know if it can still converge toward that point anymore. I should really just find all the numbers and graph it somewhere, but I’m out of time tonight. Maybe I can do that tomorrow.

I’m not sure that follows. 19.88 looks close to 20 on a linear scale, but that doesn’t mean there’s no relevance to the difference. The difference between 19.9 and 19.99 is somewhat like the difference between “2 nines” and “3 nines” of availability, because alpha should constantly get closer to (but never actually reach) 20 as the history grows. (In practice, it probably can reach 20, because floating point values can only handle so much precision :smiley:)

Good idea! It wouldn’t have to be raised all at once.

The writability check happens every 5 minutes by default, and the readability check (making sure the directory has been correctly initialized as a storj data directory) happens every 1 minute. The readability check is the one that would help in that situation, so that’s good.

A really good point. I haven’t looked at the model assumptions for how long bad nodes can remain online in too long. I’ll try to find that. (The person who would normally know all of that, and who wrote the beta reputation model adaptation for Storj, left recently, so we’re trying to keep up without them.)

Certainly true. And 0.999 is fine with me, as long as we make sure bad actors are DQ’d quickly enough (for whatever that means).

It isn’t exactly, but it’s quite close. The only difference is that they also “forget” history. Let’s deal with current settings of lambda at 0.95 and a perfect long term node, so alpha at 20 and beta at 0.

What essentially happens when updating both alpha and beta is that the formula “forgets” 1-alpha (5%) of the historic count dropping it to 19 and adds on the new value of 1. The only reason it stays at 20 is because at that level alpha * (1 - lambda) = 1. So you basically forget 1 and add 1 at that point.

The same operation happens for beta, but with beta being 0 it ends up being 0 * lambda + 0. Which isn’t all that interesting.

So you could see alpha as the number of successes it remembers and beta as the number of failures it remembers. And both are always updated because every new signal makes it forget part of the older signals.

It will. Going by the same numbers used above. Say alpha somehow got to be 100 and beta 0. Because it still forgets 5%, the next successful audit would drop alpha to 95 because of that forgetting factor and add 1 for the new success, dropping it to 96 in total. It would again converge around 20 after enough audits.

I used just alpha as an example for these because it is easier to see, but this would also apply to the total of alpha + beta if there is a mix of signals. Because it loses 5% of the total preexisting values and always adds 1 to one or the other.

While you could technically say a node at 19.99 is a lot more reliable than a node at 19.88, it doesn’t really help you at all. This is because using the current numbers, that total will drop by 5% on a failure, which is essentially a drop of about 1 either way. Since that difference is so minute, the much more reliable node would probably still be disqualified after exactly as many failed audits as the less reliable one. Being close to 20 makes it a lot harder to go up even higher, but barely has an impact on how fast it will go down. (On the other end of the scale, when you’re near 0 it’s very hard to go further down, but easy to go back up)

Awesome, well that should cover it for now as long as I don’t get 10x as much data. But since this node already holds 16TB, that would be a great problem to have actually. :grin:

Oof yeah, I know how that feels. Seems like you’re managing quite well though.

I remember reading about someone getting disqualified in 4 hours because the system froze and the node would initiate a transfer, but then time out.

Returning corrupt data or returning “file not found” with a separate check if the drive is still plugged in usually means the node actually lost data. However, a time out does not really mean that. Then ode could be overloaded or frozen.

That would be nice, though I do not know it it would work in that particular instance. AFAIK, part of the kernel froze and anything trying to access the disk would just sit in D state forever. I don’t know if a timeout for that would work. If it would, tat would be great, since IIRC that particular node was disqualified in a few hours for failing audits (4 audit timeouts → 1 failure and in 4 hours the node got enough failures to be disqualified).

I am taking about this incident:

Yeah I do remember that. I actually ended up writing a blueprint after workshopping a possible solution for that with @Alexey in one of the topics where that happened. There’s also a suggestion for that here: Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive

I don’t think that can really be fixed by tuning the scoring alone though and the suggested changes shouldn’t really impact this (@thepaul’s numbers just drop the number of audits from 10 to 9, my suggested numbers would increase it, but not enough to actually fix it).

The suggestion I posted above could help fix that.
Alternatively, you could set a lambda value per node based on how much data it has. So that lambda = 1-(1/(number of audits for the node in past 48 hours)) with a minimum of 0.999 or one of the other suggested numbers. This would essentially translate it into a time based memory. But I don’t know if this data is easily available to the process that updates the scores.