Tuning audit scoring

BrightSilence · February 14, 2022, 11:21am

Well month was just a suggestion, but I can elaborate a little bit on my thinking for that. Younger nodes without much data don’t get that many audits and if it also takes a few days to diagnose and fix the problem, there may not be enough audits / repairs to recover the score above the threshold, especially given that it drops a lot faster than it recovers. On the other hand, while the node is suspended the data is already protected, so there isn’t really a need to rush permanent DQ. So I figured a longer period would spare SNOs from losing their node in those situations and spare the support staff from having to deal with more tickets regarding disqualified nodes that will inevitably come. There’s a balance to be found here.

I would say this is again a good reason to not use too short a period. It also give you time to fix things software wise if necessary, without permanent impact and the need to manually deal with loads of support tickets.
This is also where data could prove valuable. How often do nodes spend more than 30 days in suspension? And those that do, do they ever tend to come back from it now? Because long suspension seems costly if you’re still paying them.
And of course there is that elephant in the room, that doesn’t need to be repeated. But this could (and probably eventually will) include nodes that really SHOULD be disqualified despite returning unknown errors only for different reasons.

All of those options sound reasonable to me. But from user reports here on the forum I can tell you that loss of ingress and data loss to repair is already acting as a good incentive for SNOs to want to get out of that state. Additionally the fear of permanently losing a node that built up reputation and data over time works as a very strong motivator as well.
If I could pick I would prefer avoiding egress by excluding these nodes in node selection rather than letting egress happen but not paying for it. Seems a bit more fair and also eases load on nodes that may have gotten into trouble for too high a load to begin with. I think it’s also quite fair to not pay for storage for a node that has not been reliable. This would also perhaps help offset the additional repair cost this triggers on your end. It wouldn’t make much sense to pay for data storage when the node has shown to be unreliable and is actively causing repair costs on the network.

Toyoo · February 14, 2022, 11:41am

What I was thinking of is a table of audit/repair/suspension/disqualification/becoming vetted events with the following fields:

timestamp (accuracy of 5 mins should be enough for this purpose),
some anonymized node identifier,
satellite,
event type: suspension, un-suspension, disqualification, audit/repair that counts as successful, audit/repair that counts as failed, audit/repair when node was offline, or node graduating to a vetted status.

from some reasonable time window, maybe half a year would be enough? Plus some indication of node’s age/maturity if the node is older than the beginning of the selected time window.

Though, probably even just successful/failed/offline audits from a single satellite for a subset of nodes would be enough for a basic analysis.

Alexey · February 26, 2022, 5:11am

github.com/storj/storj

[storagenode] The timeout is missing when we check a storage directory

opened 05:10AM - 26 Feb 22 UTC

closed 02:22PM - 14 Mar 23 UTC

AlexeyALeonov

Bug

**Description** If the HDD has issues or underlaying OS, the dir verification can hang forever waiting for write or read to finish, as result node will be disqualified very fast. See https://forum.storj.io/t/tuning-audit-scoring/14084/32  **Steps to reproduce the issue:** 1. Take or emulate freezing HDD (it is present in the system, but any request will hang forever) 2. Run storagenode 3. Check the state - it will freeze on write or read dir verification forever, resulting audit timeout on any read or write to blobs. **Describe the results you expected:** The verification dir methods should have a timeout and crash the node if the check is not succeed to prevent audit failures. **Describe the results you received:** The verification dir methods hangs forever and the node fail audits because of timeout and will be quickly disqualified. **Additional information you deem important (e.g. issue happens only occasionally):** Tests: https://github.com/storj/storj/pull/4183

elek · February 28, 2022, 10:01am

Thanks, This is is a good list and helps to start. Unfortunately, this information is not saved right now, they can be available via metrics and log output but not directly in this form.

But I think it would be great to have this (and expose it to the node operators), will think if it can be implemented in an wasy way…

elek · February 28, 2022, 10:05am

As far as I know the suspension ends with DQ after one week currently (at least this is what I have seen in the code, didn’t check the pod config, yet).

But yeah, it’s not easy: The period should be long enough to give enough time to fix the problems. And short enough to void cheaters to make node suspended without good reason.

Personally, I think the one week is good enough, but we should improve the notification system…

elek · February 28, 2022, 10:09am

Back to here. There is a clear proposal here to change the beta parameters and a lot of idea how the overall reputation system can be improved, but that’s a bigger step / tasks.

So I think the proposal here is to update the numbers first, and in the next step improve the other part of the reputation system. (if it makes sense…)

BrightSilence · February 28, 2022, 10:57am

Going by conversations on the forum it seemed that this was not yet in place. But maybe I’m remembering this wrong. If it is already in place, I see no reason to deviate from what already exists.

+1 from me on that.

BrightSilence · February 28, 2022, 11:12am

Yeah, it sounds reasonable to update numbers first and them move on to larger changes. I think I addressed all relevant comments from my end in my initial response here.

In summary in my opinion the suggested numbers would be an improvement. But there are better settings to go with. @thepaul mentioned some issues with transitioning to those settings and I’m not certain those have been resolved.

So in short, I would prefer lambda = 0.999 and threshold = 0.95 or 0.96 to smooth out the scores. But the proposed changes by @thepaul would already be an improvement, so if that’s the best we can get, it’s better than what we have now even if it doesn’t get rid of the volatility.

But I would like to reiterate that this volatility was one of the major concerns from the SNO community. And the solution seems to shift more towards protecting the network from bad nodes. Which I understand has priority, but I don’t want that to overshadow the SNO concerns addressed here. Even with the proposed changes, the remaining volatility would still steer node operators, who are trying to fix issues, wrong by allowing scores to recover significantly even if nothing has been resolved.

thepaul · March 5, 2022, 1:49am

Sorry for the dead air from my end. I’m still wanting to find out what rate of data loss our model can tolerate before we make a decision on the ideal lambda and disqualification threshold. Give us just a little more time to make that determination. I need to loop in some additional people.

SGC · March 5, 2022, 11:30am

its very understandable that it takes time, better safe than sorry in this case as it could bring down the entire network…
maybe some sort of simulation or actual tests on the test net might be the way to go…
math quickly gets a life on its own, especially when it interacts with the real world.

there is no rush to get this fixed, its been running like this for ages… so its fine, but a solution would be better ofc when its theoretically and verifiable good.

thepaul · July 27, 2022, 5:51am

I’m back!

After re-reviewing this whole thread, I’ve re-run the simulations with 2 significant changes:

It’s fine to start with an alpha >= 1/(1-lambda).
It’s fine to reset all node reputations as a one-time change.
An added point of evaluation is “how long does it take to get from a perfect score to a DQ when (apparent) data loss is suddenly 100%?” We want this to be larger than 10 audits, but probably less than hundreds of audits.

Given those, I don’t see any set of parameters that does significantly better than @BrightSilence’s:

lambda=0.999
DQ threshold=0.96
initial alpha=1000

With those parameters, it takes around 40 consecutive failed audits to go from a perfect score to a DQ. On a busy node, that still is not a very significant amount of time, but it is at least several times larger than what it was. If we have the writeability check timeout added as well, this seems like it could be an acceptable situation.

The grace period I was suggesting (no disqualifications before 50 audits have been completed) no longer makes a significant difference with this high initial alpha, so I’m taking that out.

So the new change plan is:

Change lambda to 0.999
Use alpha=1000, beta=0 for new nodes
Reset all node reputations to alpha=1000, beta=0
Change the DQ threshold to 0.96

I think we can even do all of these at the same time. What does everyone think?

Pentium100 · July 27, 2022, 7:14am

With the new numbers, how quickly (in hours) can a node be disqualified with no real loss of data, but with apparent loss of data (frozen IO subsystem, overload leading to timeouts etc)?

SGC · July 27, 2022, 11:11am

this change isn’t really a fix for the issues with nodes getting DQ for system issues.
but to ensure that in the case of minor dataloss the audit score flux is greatly reduced, to avoid near random DQ in such cases.

think the points are best described by Bright in the post below.

i still think we should have some sort of local fuse type trigger which will shutdown a node that starts to rapidly drop in audits, but that is really a local thing rather than a network thing.

ofc with a to effective fuse, nodes will never get DQ, which is partly why i never bothered making a node fuse script one, as it would be a bad feature introduction, even tho its very possible to script something like that.

i think it would be great if there was a fuse in the node itself… so that one got like 3 tries to fix the issue before the node died…

like say the node fuse feature would offline a node after a rapid unexpected 10% drop in audit score.

Toyoo · July 27, 2022, 11:17am

But it will change the situation for people affected by problems described by @Pentium100. I believe it would be worth checking as well.

Also, it would be nice to know whether in case of real problems, if the SNO fixes the problem, how soon will they get feedback that it was a correct fix.

SGC · July 27, 2022, 11:52am

This change now suggested is so far as i can tell almost exactly what @BrightSilence suggested and gave his reasons for in the post which i just linked.

it improves the node audit behavior incase of failed audits, if you have experienced dataloss on a node you will know, that the audit score jumps around… not a little but a lot, and if it ever touches the 60% the node will be DQ instantly.

this change makes the audit score less volatile so it won’t just go back to 100% and then drop to 90% only to go back to 100% maybe an hour later.
and it also as thepaul says increases the number of failed audits required for DQ.
which in the current system can be something like 9 audits.

so really no matter which view point one has, this change is a big improvement for anyone in all cases…

only nodes that will suffer from this new change is the ones that have been lucky enough to survive with higher than 4-5% dataloss

i have also had a lot of weird things happen with my storage over the few years i’ve been running storagenodes, and even with loss of contact with the storage, even if the node doesn’t shutdown, it doesn’t seem to affect the audit score at all…

i’ve had nodes run for 4-6 hours without any contact to the storage media, because it was stalled out due to overload and barely even seen a drop in suspension score from it.
so i don’t know how real the problem of unfair DQ actually are outside of the ones that happen due to audit volatility.

@thepaul
i think it looks great, couldn’t have hoped for better.
also pleased that this hasn’t been rushed…
will be interesting to see what bright says… but as its basically his suggestion, i doubt he will disagree on this change.
but i might be missing something, i haven’t exactly done the deep dive he did into this.

SGC · July 27, 2022, 1:07pm

yeah damaged or corrupted files aren’t really forgiven… took like 18 months before a node i lost a few files on started to not randomly drop to 95% audit score.
ran the node for 2 minutes and then did a rsync without removing the --delete parameter so the newly uploaded files was removed…

but it has now returned to 100% afaik… rarely checks on it… so it might drop for short periods… however its been a long time since i noticed it.

yeah this is true, but then we get into some of the subjects discussed earlier in this thread, the current algorithm isn’t just one that can be easily changed, i forget the exact reasons.

so tho the parameters can be changed, the method remains the same and sadly that method does make it so that larger nodes will be DQ faster due to more audits over less time.

yet like i said i’ve had plenty of system issues and also dared to experiment at times, letting my 17TB node sit for hours, to see if it would actually recover… but never actually failed an audit because of it.

so i’m not convinced it’s a real problem.
it could simply be that those that actually get DQ had bad storage and just think they was unfairly DQ.

if there is something i’ve found while running larger storage setups, is how unreliable and completely random hdd’s can be.
so in cases of unreliable storage behavior it’s possible that people might see DQ, without the disk actually being broken, due to lets say a bad cable corrupting writes…

which is why i always recommend larger multi year old nodes to run with redundancy, but don’t for new nodes…

again with this new change of the audit score algorithm parameters should make the score more stable and thus it will be easier to see when it start to drop because it won’t jump back to 100%.

so yeah… a node fuse type feature might be good, but not sure its really needed…
i haven’t seen any signs of it, and i’m running quite the number of nodes.
but its something i worry about, which is why i have been trying to determine if it is something i should be worried about, and these days i don’t really worry about that part…

BrightSilence · July 27, 2022, 5:16pm

Thanks @thepaul for getting back to us with this. I’m happy to see you liked my suggested parameters.

It won’t surprise you to hear I think this is great. I think it ticks all the boxes with the exception of still DQ’ing nodes a little too fast when temporary issues make them fail all audits. But to be honest, I think it’s impossible to fix that without making DQ too slow for legitimately bad nodes, by just changing the parameters.

I haven’t redone the testing on this now, since I already tested these exact parameters when I posted my previous suggestion and it showed it would match the intention of allowing 2% loss to survive, but not 4% or higher. Since it also fixes the issues I listed when I posted my initial suggestion. Yeah, this sounds like a home run to me.

I agree that there probably need to be other things in place to prevent temporary issues like this from DQ’ing the node. Implementing the time out check would help with that. I think I’ve also seen this happen when permission issues prevented the node from reading the files. I’m not sure why the readability check didn’t catch those in some examples posted around the forum. But that can be solved by other means.

It will happen about 4x slower. On large nodes this isn’t a fix. But it helps a little. It can still happen in an hour on the largest nodes for the largest satellites on that node. But these changes weren’t meant to fix that. That they help is just a small bonus.

I think if the timeout implementation for the readability check is in place, those scenarios will be solved as well.

With the proposed changes a 10% loss would be a guaranteed DQ. The current formula is indeed too forgiving. I’m sorry if that would cost you this specific node. But I don’t think that’s unfair.

Pentium100 · July 27, 2022, 9:21pm

At least it’s not faster. IMO there should not be a situation where a node is irreversibly disqualified in less than a couple of days. Reversible suspension etc can happen quickly, but not the irreversible DQ.

jammerdan · July 28, 2022, 3:35am

I’d add not for ‘older’ nodes. The node age and size should help to indicate, if DQ is really the right option.

jammerdan · July 28, 2022, 4:07am

Maybe this would also help to find a better DQ approach:

Because as I read it today, there is a limit on how many audits can be performed.

Maybe in the future, when more audits can be performed a better DQ process is possible for example:
If a node gets disqualified but the issue was something the node operator could not see or because disqualification was to fast, a node operator could apply for a requalifying. This could mean that the node gets no ingress for like a month and gets hammered with audits around the clock or something like that to reinstate the reliability. I cannot make exact proposals here, but the idea should be clear.