New audit scoring is live

Pentium100 · August 27, 2022, 9:40am

If repairs are not random, then maybe they should affect the audit score less? Something like two failed repairs = 1 failed audit. Otherwise there would be unfair DQs.

Pac · August 29, 2022, 9:00pm

I must say, even though the “audit score” could probably be renamed for better clarity (and/or span on a larger scale depending on what it should reflect), I really like this new system better

Before, one of my nodes that did lose data in the past was randomly at 100, then 90, then back to 100 few days later, then 95, rarely at 85… It was difficult to get a good idea of its health.
Now, it’s more or less steady around 99.85% which is very nice and more usable.

I say, nice job to the Team and all forum members who actively participated in improving the audit score system!

BrightSilence · August 29, 2022, 9:23pm

Nice to see it working as intended! Thanks for posting your experience.

thepaul · August 30, 2022, 7:57pm

I don’t know why that is happening. I think the most likely explanation is that the line was simply emitted twice to the log, because of some minor bug with logger setup or confusion about whether the line had been logged already. Nothing in the Storj network is sent over clearnet; everything is encrypted and authenticated over TLS, so duplicate packets or leaky networks or bad drivers aren’t to blame.

@BrightSilence is right; that’s not the way things work at the moment. It could be made to work that way, but audit failures are not so frequent that it seems like an important thing to do.

For one thing, it is sometimes the case that a clear failed audit becomes a successful audit in the future, when mount points are fixed or data moved to the right location, etc.
Then, even if we changed the code, it would still be possible that the satellite does not receive the error from the failed audit on your node before the audit times out. The error might be logged on the node but the audit resulted in a timeout error on the satellite side. This would put the node in “containment”, where we keep trying to audit the same piece on purpose until we get a timely response one way or the other, or until the node is disqualified because of too many audit timeouts. That is to say, it might still look from the node side like we are auditing the same piece multiple times, even after a clear failure.

That’s a well-reasoned theory! I like the way you think. But no, that doesn’t match with the way we deal with the databases. The explanation is simply that we don’t mark pieces as missing on audit failure. (Yet. It could certainly change.)

That is in order of segment ID. You’re right that it is essentially random as far as the user or the SNO is concerned; I meant it is not random from our perspective, where we know what the segment IDs are already. But I think it’s not an important difference in this case.

There can indeed be clustering of errors within a segment; if one piece had a problem, it is more likely that other pieces of the same segment will have errors. But since no node should have more than one piece of that segment, that shouldn’t have any clustering effect on nodes themselves.

I think you’re entirely right, but I think it’s an acceptable situation. If a node loses 10% of its data very early on but they fix whatever the problem was and accumulate lots more data that is not lost, before we have time to discover the data loss by audit, it is probably fair to treat that node as though it had only lost the 1%, which of course is the case. If we didn’t have time to notice the early data loss by audit, then it is also less likely that the missing pieces had any effect on their respective segments’ durability during that time.

Not sure what you mean here. Do you know of any nodes with 40% data loss that weren’t disqualified? The old audit reputation was not as closely correlated with data loss percentage as it is now, with the newer beta model parameters, so it would have been harder to identify the nodes with 40% loss, but still it’s quite likely they would have been disqualified.

BrightSilence · August 30, 2022, 9:36pm

As long as they are not generated in order that should still make the order random though. I can’t really explain the clustering in the graphs then though. Unless these errors aren’t caused by missing data, but by slowdown on the node itself. But in this case I believe missing data was what caused the errors.

I think @SGC meant that it didn’t require 40% data loss to be disqualified before. Personally I would consider that part of the problem with the old system. The threshold had to be that low because of the erratic score behavior, not because 40% data loss was actually allowed to survive. It became way too much a game of chance rather than a clear cut off of what’s acceptable. The new scores still have a small margin of chance but it dropped from a 20% range to a 2% range (roughly). So while the old system suggested that up to 40% loss was acceptable, anything beyond 5% loss could in some cases lead to disqualification after a long time. The new system may suggest 4% loss is ok, but anything beyond 2% loss is dangerous. That seems much more reasonable to me.

In the end I think the reasons were clearly outlined in the linked topic and in my summary here. So @SGC have a look at that summary and if something is still not clear as to why this was chosen, maybe you can be more specific?

zyvier · August 31, 2022, 5:31am

sorry i might have missed it… but i just wanted to clarify… is DQ enforced now?.. does that mean when a node is disqualified moving forward, it is now permanent?

i recall currently there is a leeway for a node to resolve their issue within a 7 days turn around?
if a node’s underlying harddisk is starting to fail, but have yet to migrate the node data over to a new device… they will be DQed immediately?

lastly… is there a means for a node to quickly go offline once the failed audits suddenly starts increasing? as i’m in the understanding that an offline node can be rescued and recovered with a longer leeway?
e.g… if the threshold is at 96% audit, can i configure it to go offline at 98%?

these are home run machines so the operations/support attention is minimal and not dedicated… so just wanted to see from a node perspective, what are the options available?

SGC · August 31, 2022, 2:23pm

i’m not talking about the DQ / data loss threshold %
but the actual audit score.

before it would DQ at 60% just like suspensions happen at 60% suspension score, or online suspensions happen at 60% online score.

now it will DQ at 96% audit score, which might seem a bit confusing.
personally i don’t mind that it actually shows the real world numbers instead of some rather arbitrary number…

but i think it would be less confusing if all the scores was bad at the same level, atleast for those with zero knowledge looking at it…

so what i propose is that the 96% audit score becomes 60%, ofc that is also rather confusing, which is why i suggested the whole concept of how to easily read it might need to be revamped.

from what i understand / remember the reason everything was going by the 60% concept was that online score is over 1 month / 30 days and 288 hours of downtime is 12 days which is 40% of 30 days.

so for ease of use that 60% score being bad just to imposed on the audit and suspension scores.

in short, it needs to be easy to read and learn the dashboard readout.
having multiple different thresholds makes it more difficult… but i must admit i kinda like it being the real world numbers.

maybe there are other ways to make it easy to read and simple to learn, the goal imho being the dashboard being readable with minimal knowledge.

Alexey · August 31, 2022, 5:20pm

It is always permanent.

for suspension, not disqualification. There is no suspension before disqualification and never been. The suspension could happen:

the suspension score fall below 60%;
the online score fall below 60%.

The disqualification is now happen, if audit score fall below 96%.

depends. If it lost more than 4% of data - it will be disqualified.

The current version of storagenode does not have this feature. But basically - yes, it could give you more time to fix an issue and have a deal with consequences of low online score later:

if your node will be offline more than 30 days it will be disqualified anyway;
while your node is offline it losing data (offline data considered as unhealthy and can be recovered to other nodes, when you bring your node online, the garbage collector will remove this data from your node. Longer you offline - more data will be moved from your node to a more reliable ones).

BrightSilence · August 31, 2022, 9:11pm

But the score literally is a rolling estimate of the data retention percentage (the inverse of data loss). These aren’t two separate things. It’s just that that estimate used to have very little memory of previous audits, so it used to be all over the place. Now the estimate has more memory and is more accurate and so the threshold can also be a lot closer (and has to be) to the actual acceptable data loss.

I don’t really get this, because 60% is even more arbitrary than 96% because it has less meaning. Maybe take a look at @CutieePie 's suggestion a few posts up, which happens to also be how reputation scores work in my earnings estimator. It’s basically replacing the actual percentages of data retention with 0% = all good, 100% = disqualified. That gets rid of all arbitrary numbers and replaces them with something everyone can understand intuitively.

I agree with your goal of making it easier to understand, but not this proposed solution.

Is this actually correct? I was under the impression that the system switched over for both scores, since they both use the beta model.

zyvier · September 1, 2022, 2:03am

thanks for the clarification!
could we think about including this feature in future releases?

noted that when a node is offline it starts losing data, but a few days is probably what someone need in order to replace/migrate the node… would do good to longstanding nodes maybe?.. besides… i think the point is most of these machines are running at home while we are at work… and not everyone practices remote monitoring/management of the nodes so an easy config helps alot

on the note of 60% and 96%… why don’t we add finer details to the screen?
like additional details in a hidden layer that can be expanded, detail the following:

how many failed audits in the last 1000 rolling - e.g 10
whats the threshold measured against - for now 40

and of course other details that one may think that helps

Alexey · September 1, 2022, 5:37am

All three scores are independent, but we can ask @thepaul

Alexey · September 1, 2022, 5:38am

See

SGC · September 1, 2022, 7:43am

yeah i totally agree the 60% limit is very arbitrary, the 0 to 100% is a very good idea…
totally forgot about that…
i think we should push that as a feature request, but since it’s CP’s idea i don’t want to take the credit.
@CutieePie maybe you want to call for a vote on this?

BrightSilence · September 1, 2022, 8:46am

We’ve been talking about this for quite a long time. Over 2 years ago I posted this mockup as a suggestion. But I guess I never posted a suggestion in the appropriate area.

However, this suggestion seemed to get at the same thing. But the wording of the suggestion was really unclear and the conversation died down quickly. Asking for more clarification and showing my suggestion didn’t get a further response. My Suspension score 100%

SGC · September 1, 2022, 10:10am

yeah that is basically the same concept, just with a bit more graphical details.
may have glanced at it when you first mentioned it… still a good idea.

0 - 100% would do the same thing, and only modification would be to a couple of variables somewhere in the dashboard or storagenode api backend i guess.

another thing we have to keep in mind is that StorjLabs most likely has plenty of stuff to do, so if it isn’t straight forward and there is not much benefit, they might disregard good ideas.

who ever makes the suggestion and whatever the best solution end up being, it has my vote.
for now i would think the 0 - 100% has the best odds of being implemented and would do the job just fine.

non-ICE does make a valid point that its odd that suspension score is 100%
sort of sounds wrong or gives the wrong idea…
ofc setting that to 0% instead of 100% makes it also look kinda off…

so maybe suspension score could be renamed or something…
audit score 100% and online score 100% makes good sense tho…

i suppose like non-ICE says, it doesn’t need to be called suspension score… just like audit score isn’t called DQ score.

maybe Health Score would work…
suspension score is a measure of faults or errors created by the node…
which is sort of a node health indicator…

might get confused with audit score… since it do sort of remind of SMART health for HDD or such.

so maybe another word… but something like health which most people would understand.
integrity score maybe… but doubt thats nearly as well understood a word and again it could be misinterpreted to have something to do with HDD / storage

coherency … might work… but then that is most likely a more obscure word than integrity with limited english skills.

but a node throwing errors is essentially incoherent and it isn’t because of the data being bad…

yeah… i kind like that… coherence score

but i guess, system score, or consistency score might also work.

SGC · September 1, 2022, 11:51am

this is weird…
never actually had an unexpected audit drop before…
could this be some sort of rounding error or so such thing?

kinda doubt i actually failed an audit…
got two nodes showing this and both on ap1 only

can’t exactly rule out it being a hiccup with my storage, have had some issues during a recent replacement of some disks, but i believe zfs did manage to resolve the inconsistencies, which is sort of expected as i run with redundancy.

anyways, wanted to mention it, incase somebody else is seeing something similar.

Vadim · September 1, 2022, 11:59am

it can be some restarts downtime for update, also after update, it maks space calculaton and can drop any repear due to high I/O

thepaul · September 2, 2022, 4:57am

That’s right, we did not change the “unknown audit” score model. It doesn’t have inputs coming in at anywhere near the same rate, and there isn’t any particular mathematical reason why it should operate in lockstep with the standard audit score, so it deserves its own analysis and research done before we change that part.

Pac · September 3, 2022, 12:07am

Well, I do agree with @BrightSilence: this matter has been discussed in the past, a few people suggested several great improvements for how scores could work and be displayed, but it never got much attention. I’m not sure why

The link I already posted in this thread ~10 days ago is not the only one where the subject was discussed, but it is votable and does gather many interesting ideas, IMHO.

If someone else has a better way of expressing those ideas so they get more traction, please be my guest, I’ll be happy to vote and support them!

Alexey · September 3, 2022, 10:12am

I think it would be quickly if someone from the Community will make a pull request regarding this.