Tuning audit scoring

nerdatwork · February 10, 2022, 4:41pm

I, too, can attest that when my computer froze a few times my node was DQed on a satellite and not suspended. My mouse and keyboard showed no activity (no mouse pointer movement and no changing of LEDs on keyboard). There was network activity though as the LEDs on the NIC were blinking. I would have expected suspension to be able to fix the issue but never got a chance as the DQ was fast.

thepaul · February 10, 2022, 4:56pm

It is possible for a computer to stop responding to user input events while still responding to network events. You could still call it a “freeze”, but not a “kernel freeze”. I can only surmise that is what happened here. Adding a timeout to the readability check should help a lot with this sort of situation.

nerdatwork · February 10, 2022, 5:24pm

I understand this is off topic but could you please tell me how to identify a “freeze” vs “kernel freeze”.

Pentium100 · February 10, 2022, 5:37pm

Sorry, I may have used a wrong term for this.

There have been instances in other servers where the disk IO system froze (some zfs bug). Any process that tried to access data on the disk would just stay in D state forever (with dmesg getting some “process has been blocked for more the …” messages. I do not know if it is possible to have a timeout on that (it’s a bit too deep for me, some other thread maybe?
The drives themselves were fine, I could access them with dd or whatever and after a reboot everything went back to normal.

This is a bit difficult to catch with a script. If the script is not cached in RAM, then the system would try reading the disk and freeze, never starting it.

Something similar happened in the thread I liked to. A node was initiating transfers but never completing them (until a reboot) and was disqualified in 4 hours.

thepaul · February 10, 2022, 5:52pm

Oh, sure. “Kernel freeze” isn’t really a well-defined term of art, but I expect it could only refer to what is called a “kernel panic” in the Unix world or a “stop error” (BSoD) in Windows land. Typically in these cases you’d get stack trace or memory dump information on the console and the system would stop doing anything else until rebooted.

Skyblockpro1 · February 10, 2022, 10:55pm

I think this could be mitigated simply by applying the new solution and changing the percentage form 15-4% in a period of time eg a year IDK i would have to run the numbers but point being that as the bad satellites would get kicked of the network gradually it would solve the issue.
but if lots of satelites would be in one wawe of write offs we would be still risking the integrity of the network.
Another possibility though this might seem sneeky but a possible solution is to identify the satelites and DQ them behind the scenes without the opperator knowing. EG Dead satelite and moves all the valid data to other nodes before signing off on the network and only than would the node opperator be notified. the delay in notification would be ther so they dont take it offline

I have explored different possibilities buthavent found a bullet proof idea

Alexey · February 11, 2022, 2:08am

It’s responding, when the rest of hardware is not. At least it was in my case. And I think in others.

Alexey · February 11, 2022, 3:06am

I have had such a thing regularly when I tried to migrate to Linux (see Moving from Windows to Ubuntu and back). I configured to reboot after kernel panic, but this is doesn’t help - the reboot is not enough, you need to reset it with the hardware “Reset” button. This is for Linux, Windows happily reboots after BSOD if configured and working normally.
In my case one or two cores of the CPU are hanging (at least this is how I can interpret messages in dmesg) and system become unresponsive to user’s events, however it doesn’t prevent answering on audit requests, but it doesn’t returns anything (because this thread become hung).
However, this never happen under Windows on the same hardware. Even BSOD is occur only when the fresh new Windows version is rolled out (you know, Windows rolling updates). The same new version allowed to be installed half of or year later is working normally.

elek · February 11, 2022, 8:21am

Sorry, if it was not clear.

What you see here that – with this lucky order of audits – we can survive 33% of data loss (lambda=0.95, initial score=0.66, events = failure, succes, succes, failure, success, success,… → the score doesn’t drop below 0.6…)

And this is true with any lambda.

And this is not possible if the score is higher: it works because with lower reputation score the healing is faster. (at 0.666 we can survive ~1/3 data loss if events are ordered well, at 0.95 we can survive ~1/20 data loss if events are ordered)

In practice, you have a chance of more than one failure out of three event, even if you have lower data loss (this is the part where probability steps in).

But if we don’t like this, the only solution is increasing the DQ threshold, as this behavior comes from the calculation method not from lambda.

elek · February 11, 2022, 8:37am

I am not sure either, but here is what I am thinking about:

Why do we DQ the nodes almost immediately (without giving grace period like we do with unknown errors)? What is the risk?

not found: in this case other pieces will win, and pieces with not-found response will be ignored as part of the long-tail cancellation.
in this case the EC recovery will fail if enough pieces are wrong. In very unluck case (?) EC may recover a different stripe, but MAC of encryption will detect the problem. Anyway the stripe should be downloaded again.

You may right, the difference is not so big.

But I feel very small difference between “not found” piece error and offline nodes. Today the first is sanctioned quickly (DQ) the latter has a grace period (suspend).

This is not 100% fair for me and I think this was the origin of some complaint (why didn’t have node operators with long reputations any chance to fix the problem: with DQ the change is very slow, with suspend + grace period there is a chance to fix problems)

BrightSilence · February 11, 2022, 3:53pm

So, I’ve been thinking. Given that:

The node volunteers the “file does not exist” info to begin with and the satellite can’t reliably determine the difference between known and unknown failures if “creative” SNOs don’t want it to
Suspension already takes care of protecting data on the node by marking pieces unhealthy and queueing segments below the repair threshold for repair
Raising the lambda stabilizes the score around the actual percentage of missing data / failed audits

Why differentiate between known and unknown audits at all? Why not recombine them into one score, set a high threshold, like 97%. And have any type of audit failure hit a single audit score. When a node drops below that threshold, they get suspended and get grace period (say a month) to fix issues and recover the score. After that start a monitoring period (of say a week). If the node drops below the threshold in that week, disqualify them permanently.

This will result in about the following based on some early simulations I ran:

Nodes with actual data loss of 4% or higher won’t be able to get out of suspension to begin with and will be disqualified after a month. In the mean time they are suspended and repair has already kicked in. So delaying the permanent disqualification causes no additional harm.
Nodes between 2% and 4% data loss may go in and out of suspension during the grace period, but will likely still be disqualified during the monitoring period.
Nodes between about 1.6% and 2% file loss may or may not survive the monitoring period. It depends on luck.
Nodes with at most 1.5% file loss who got suspended with temporary issues get a chance to recover from that and will survive the monitoring period if fixed in time.

Possible downsides

Nodes that encounter temporary issues again during the monitoring period will get disqualified.
Node operators who just want to see the world burn could block access to data during the grace period, then allow access again shortly before the monitoring period and after that is done remove access again. However, if they do, they will spend most of the time in suspension, not getting any data and losing data to repair. It wouldn’t really do damage other than having perhaps a small impact on repair costs. And there is no upside to doing this as it requires you to store all data anyway.

Note: This isn’t entirely fleshed out yet and probably needs some refinement.

Toyoo · February 11, 2022, 5:08pm

I observe this discussion from a distance and wonder—would it be possible for Storj to release an anonymized dataset of audit and repair queries? It would then be possible for non-Storjlings to test ideas on real data and see how are different modes of failure distributed over the network as a whole.

Alexey · February 12, 2022, 2:34am

There is one upside if someone wants to free up some space with such a radical method.

BrightSilence · February 12, 2022, 7:43am

I guess, if you really feel like living on the edge. But you can already do that a lot saver by just going offline for more than 4 hours. Both ways are really slow anyway, it’s not really worth it.

Of course there’s also the option to implement partial exit.

Alexey · February 12, 2022, 7:59am

Your PRs are welcome!

By the way, there is a roadmap: https://github.com/orgs/storj/projects/23

Pac · February 12, 2022, 12:28pm

I think that is a case that should be detected by the storage node. I’ve advocated for this for a long time to accomodate for SMR disks, but it would be a good thing generally:
It should be possible to detect when the number of stacked requests waiting for i/o completion becomes too high, at which point the node should stop accepting new requests.

That’s basically what the max-concurrent-requests option does for ingress.
I say we could probably apply such a system for both ingress and egress, at a high value so it doesn’t keep accepting tasks until it runs out of RAM or get killed uncleanly…

(I’ll be honest, I didn’t read in thourough details the whole thread which is heavily technical - my apologies if my suggestion conflicts with anything or is off-topic)

littleskunk · February 12, 2022, 1:12pm

The audit job has a maximum concurrency. You wouldn’t be able to stack up requests. You would fail the audit requests basically one by one (satellite side timeout of 5 minutes) and still get DQed after a few hours.

Pac · February 12, 2022, 2:30pm

Not if the node stops accepting audit requests? But again, I’ve got the feeling I’m not seeing the big picture here ^^

elek · February 14, 2022, 10:40am

If you can define what kind of data would be useful (fields/data…) I can check if it’s available easily or not…

elek · February 14, 2022, 10:51am

I like the idea. I think it would be better to use a grace period in case of any problem. (not month, but eg. a week.)

But there are open questions here (IMHO):

The goal of unkown score + suspend period is to avoid disqualification in case of any software error (which is not the fault of SNO). With this scheme, one can be unfairly disqualified if software error happens after a suspension period. (But we can argue that these cases should be rare and handled by support manually)
There should be some motivation to avoid the suspension state. Either avoid using the suspended node for download (-egress payment) or fully stop the accounting during suspension period (-egress/storage/repair payment).