Got Disqualified from saltlake

I’ve been keeping up with this topic as my node had an almost identical issue a few months ago, which luckily I did catch. Randomly my hardware raid controller froze, causing access to storage to go very slowly. The whole computer went very slow, the storagenode was technically still accepting requests but was not responding to anything as far as I could tell. I ended up rebooting the computer which “fixed” the issue.

From an outside perspective if the node isn’t responding I don’t think the node could crash itself, but on Windows maybe this would be a suitable task for the storagenode-updater service. It would be helpful to have more information about a failing node, to understand what factors could be used to determine if it should be crashed/shut down. Perhaps a scheduled task which checks for a response from the node api? or a log scan looking for ‘Downloaded’ in recent entries?

1 Like

If so, setting up monitoring is also useless. Even if this kind of issue is transient, you want these nodes to disqualify, because monitoring is not an alternative unless it would be capable of waking up node operator at midnight.

It’s really not helpful to make such statements, it’s clearly not true. Lets try and keep it constructive. It’s a tough problem to tackle, it needs a fix and it’s being looked into, but the danger of abuse is very real. V2 had a lot of such issues which made it quite a bad experience for honest node operators like us. Nobody wants that either.

Well, that’s how I interpret:

Could you please formulate it in more constructive key and not destructive as I did?
I want to learn how to deliver a bad news better.
The key information: node failed audits thus - disqualified.
I did not want to create an impression, that “I want to disqualify”.

4 Likes

By the way this bug (with repeated audits of the same piece, when the node was in containment mode) is now fixed: satellite/audit: fix containment bug where nodes not removed · storj/storj@5a1a29a · GitHub
Should be included in the nearest release.

4 Likes

That depends on what you wanted your statement to mean. I assume here you don’t want to mean that exactly four 1kB pieces not received within a 5 minute period each will lead to node disqualification, and that there are also other factors involved in the decision on disqualification, but you don’t want to discuss them right now. So all you need is to make the statement less definite by adding some wording that at least suggests existence of these other factors.

«proof», «reason» are strong words, they imply certainty. «symptom» or «sign» imply an indication of something happening, but without the nuance that they’re the only factors taken into account.

So, maybe something like: «Storj considers failing to provide a 1 kB chunk of data within 5 minutes a sign of node being currently unreliable and hence a candidate for disqualification in future.»

But if indeed you wanted to state that four failed audits should lead to disqualification, then you expressed that pretty much to the point.

2 Likes

Mathematical algorithms show that failure to provide a 1K chunk within 5 minutes in a repetitive manner indicates that the node is becoming unreliable and will be disqualified as a result of continued failures.

Storj Engineers thinks, that in some cases (“old node, which suddenly started to fail audits because of timeouts”) failing audits because of timeouts is not attempt to abuse the audit system and want to avoid quick disqualification of such nodes without affecting reliability of the network.
We did not find a way yet how to achieve that.

5 Likes

Thanks for this message, I really appreciate the switch in tone!

I think I can give some clarification on this part. @Alexey linked the containment blueprint in an earlier message, but that’s a lot to read. However, this is what it comes down to. If your node is unable to respond to an audit within 5 minutes, it isn’t immediately seen as a failed audit. Instead whenever the next audit happens for your node, it will request the same stripe again. This ensures you can’t use a timeout to get out of having to verify you still have that piece. You again get 5 minutes to respond. I believe you get a total of 3 retries, but I’m not entirely sure of that number. If you time out on all tries, then it will count as a single audit failure. From that point on audits will request a new stripe again. This wouldn’t cause a disqualification, just a single audit failure. You’d need quite a few more to actually be disqualified. Problem is that nodes with lots of data get quite frequent audits. So even if it takes 4 timed out audits to count as a single failure, you can still rack up those failures pretty fast.

I previously posted a suggestion to tune audit scoring here. Tuning audit scoring
I think this suggestion could be further tuned by raising the lambda value based on the amount of data stored. The more data you have, the more the memory of previous successful audits weighs into the formula. This would essentially remedy the inverse relation between long time loyal node and short time to disqualify. These nodes also have much more on the line, so messing with the system on such nodes is just a very bad idea to begin with. If you’re going to try and cheat the system you wouldn’t do that with a node with so much income. Just a suggestion, but I think this would help a lot with this problem without exposing the satellites to more risk.

4 Likes

Yeah, I actually remembered your post when typing one of the previous messages here. However, I believe it solves a different problem, not the one discussed here. Please consider that in your simulations you are assuming independence in subsequent audit failures. This nicely model a hard drive having bad sectors. However, in case of transient system-level problems, like heavy swapping because another task hosted on the same box took all memory and then some, you might fail all audits until the box is restarted.

You state there that your experimental choice of parameters would allow 40 audits in a row to be failed. From my point of view, this is too much and too little at the same time. It’s too little when you consider Storj now tells operators that they can go to vacations, because if their nodes become offline for 2 weeks, that’s not a big problem. Stating then that they’d have 40 hours to fix a misbehaving node makes that statement much weaker. It’s too much because the node is still considered a candidate for fresh uploads/downloads, which will obviously fail the same way audits do, making customer experience worse.

That’s why I like the idea from the other thread: it explicitly puts the node that is failing audits into a special state, not trusted enough to handle traffic, but still with hope for recovery.

1 Like

I think it could be both.
However, if we do not go the mathematical way (I think mathematics is the best solution, despite the fact that I am in the same boat - I recently migrated my Windows server to Ubuntu, and I have a lot of “general protection fault” errors, by the way, they lead to exactly this situation, and my own nodes are affected, they are not DQed yet, but I’m not sure if that can’t happen anytime soon), then it should be expensive enough not to try to abuse the timeout to avoid auditing. And I mean literally - you have to lose money to make sure this is a bad way to abuse the system.

That may have been the initial goal, but the suggestions main focus is to give the score a longer memory. Even without tuning, the current system disqualifies a node after 10 consecutive failures while my suggestion would require at least 40 consecutive failures. That already gives you 4x as much time to fix things. If that memory is further increased for nodes with a lot of data, you could easily tweak it to give those nodes 10x as much time. I would agree that it’s not a full solution and it would be better for the node to protect itself. But you can’t argue that that wouldn’t help a lot already.

Making this part dynamic makes it so that you can aim for a specific time frame to give nodes to repair issues. That should take care of the fact that it would otherwise take 2 weeks for smaller nodes and a few hours for larger nodes. The timeframe would be closer together for all nodes and you could aim at something like maybe 2 days for everyone.

I am not against putting nodes in a warning state earlier, before disqualifying them… in fact, I suggested exactly that at the bottom of that topic. However, it would require making the score more stable first, otherwise nodes would just constantly pingpong in and out of that warning state.

I wouldn’t advocate for that. Having money at stake is one of the reason I never got started with Sia. Maybe if it’s losing income, like temporarily holding back held amount again or something. That might work. But I don’t want to hide away an issue with just monetary consequences instead of disqualifications. In my opinion it should be possible to prevent honest nodes from having those bad consequences altogether. The way I see it, right now honest nodes are at a disadvantage, because someone abusing the system is definitely just going to take their node offline to postpone being disqualified, yet honest nodes currently don’t do that. So lets try to make the node protect itself first.

3 Likes

That’s the goal - to make nodes protect itself. The suspension is not a solution as far as I understand especially after showing how easy it could be abused…
I’m against putting money on stake too, but at the moment only that can stop bad actors from abusing the suspension.
It would be cool if we can make it only with math, it would be the best option.

1 Like

Maybe we can borrow the idea of “proof of stake” blockchain, which like cardano?

For example, satellites can monitor the wallet provided by node operator which receive their rewards, and if the node operator doesn’t withdrew any money, just stake it, the more the money * held time it stake, the more the satellite can trust that node.

If the node operator need that money, he can just withdrew it, but his node’s trust score may become lower, in the mean time, for any reason he’s node get audit failure, it will get disqualified soon.

But If he deposit a lot of money into that wallet, his trust score may become higher, and it will not get disqualified too soon. Note that he may still get disqualified, if he tries to abuse the system for a longer time. The staked money only affect how soon it get disqualified, not whether it should get. Here we assume that the bad actor can’t afford the cost to borrow that amount of money to abuse the system for a long time, it will take more than the rewards he tries to get.

We cannot stake or do anything with tokens, they are not securities or investments, they are utility tokens, the only purpose - simplify the payments.
So this idea would not be implemented, it will be against the purpose of the token.
We also do not hold tokens as a collateral. All calculations are made in USD anyway.

The idea to buy reputation is posted there once - Buying reputation, it was not agreed even by the Community.

1 Like

Please note that @BrightSilence’s simulation is math—a Monte Carlo method for estimating parameters of a probabilistic distribution. He has already suggested a model for analysis. It’s not that we’re trying to avoid math.

I admit that I might have missed it. Do you have in mind a specific message showing the attack vector?

I do not want to give ideas how to game the system. Your node can lie to the satellite and enforce it to hit the unknown audit score instead of audit score and your node can survive even if it doesn’t have all pieces in place.
Since your node actually not disqualified, but lies about having pieces and actually not punished for that - the audit checker would have an information, that pieces with high probability still available on your node (which are not). And if there would be not enough really healthy pieces (your node will still answer with timeouts), the customer’s file may be actually lost.

If you do not want to discuss your arguments, all that is left to me is complaining.

Also, this means there are at least two known attacks on the integrity of the network. Not very encouraging.

I would like to introduce you to the concept of charitable interpretation or charitable listening. It’s the idea that interpreting what someone is saying in the most reasonable way and assuming the best intentions leads to much more valuable debate. This goes doubly if there is a language barrier in place. Which most of the time is the case as both the community as well as storjlings are spread out across the globe.

The way I understand it is that @Alexey said that using suspension instead of disqualification would open things up to abuse. That doesn’t mean the current implementation of suspension can be abused. I’m not even sure what second attack you were talking about.

Further more, I think @Alexey did his best to give an idea of what the problem was without giving specifics of the attack. He did back up his argument.

So I’ll ask you again to please keep the debate constructive. I think this thread has already helped get attention to this issue and develop ideas to fix it. I don’t think antagonizing those who are trying to help would be helpful. Keep in mind that within Storj Labs, @Alexey actually advocates for us SNOs.

5 Likes

The current implementation of suspension is not affected.
However, if we implement the suspension because of audit timeouts - it will be. This is related to how easy someone can say nothing after they have been asked by satellite to provide a piece for audit.

I know, my answer won’t satisfy you anyway, because you want to know how exactly you can abuse. And now I want to know - why do you want to know that? Do you want to try to use?