Got Disqualified from saltlake

Toyoo · August 20, 2021, 6:34pm

Yeah, I actually remembered your post when typing one of the previous messages here. However, I believe it solves a different problem, not the one discussed here. Please consider that in your simulations you are assuming independence in subsequent audit failures. This nicely model a hard drive having bad sectors. However, in case of transient system-level problems, like heavy swapping because another task hosted on the same box took all memory and then some, you might fail all audits until the box is restarted.

You state there that your experimental choice of parameters would allow 40 audits in a row to be failed. From my point of view, this is too much and too little at the same time. It’s too little when you consider Storj now tells operators that they can go to vacations, because if their nodes become offline for 2 weeks, that’s not a big problem. Stating then that they’d have 40 hours to fix a misbehaving node makes that statement much weaker. It’s too much because the node is still considered a candidate for fresh uploads/downloads, which will obviously fail the same way audits do, making customer experience worse.

That’s why I like the idea from the other thread: it explicitly puts the node that is failing audits into a special state, not trusted enough to handle traffic, but still with hope for recovery.

Alexey · August 20, 2021, 6:48pm

I think it could be both.
However, if we do not go the mathematical way (I think mathematics is the best solution, despite the fact that I am in the same boat - I recently migrated my Windows server to Ubuntu, and I have a lot of “general protection fault” errors, by the way, they lead to exactly this situation, and my own nodes are affected, they are not DQed yet, but I’m not sure if that can’t happen anytime soon), then it should be expensive enough not to try to abuse the timeout to avoid auditing. And I mean literally - you have to lose money to make sure this is a bad way to abuse the system.

BrightSilence · August 20, 2021, 8:18pm

That may have been the initial goal, but the suggestions main focus is to give the score a longer memory. Even without tuning, the current system disqualifies a node after 10 consecutive failures while my suggestion would require at least 40 consecutive failures. That already gives you 4x as much time to fix things. If that memory is further increased for nodes with a lot of data, you could easily tweak it to give those nodes 10x as much time. I would agree that it’s not a full solution and it would be better for the node to protect itself. But you can’t argue that that wouldn’t help a lot already.

Making this part dynamic makes it so that you can aim for a specific time frame to give nodes to repair issues. That should take care of the fact that it would otherwise take 2 weeks for smaller nodes and a few hours for larger nodes. The timeframe would be closer together for all nodes and you could aim at something like maybe 2 days for everyone.

I am not against putting nodes in a warning state earlier, before disqualifying them… in fact, I suggested exactly that at the bottom of that topic. However, it would require making the score more stable first, otherwise nodes would just constantly pingpong in and out of that warning state.

Tuning audit scoring

Bonus suggestion
Now that there is a more consistent representation of node data quality, we can actually do something new. We could mark the pieces of a node with an audit score below the warning threshold (for example 95%) as unhealthy. And at the same time lower the incoming traffic for that node (they could be part of the node selection process for unvetted nodes). This will result in data for that node slowly being repaired to other nodes, potentially reducing the troublesome data and fixing it, while at the same time lower the risk of new data stored on that node getting lost. This will rebalance good and bad data on the node and will allow nodes that have resolved the underlying issue to slowly recover, while at the same time nodes that still have issues will keep dropping in score anyway and fail soon enough. This also provides an additional incentive to keep your node at higher audit scores to keep all ingress and prevent losing data to repair.

I wouldn’t advocate for that. Having money at stake is one of the reason I never got started with Sia. Maybe if it’s losing income, like temporarily holding back held amount again or something. That might work. But I don’t want to hide away an issue with just monetary consequences instead of disqualifications. In my opinion it should be possible to prevent honest nodes from having those bad consequences altogether. The way I see it, right now honest nodes are at a disadvantage, because someone abusing the system is definitely just going to take their node offline to postpone being disqualified, yet honest nodes currently don’t do that. So lets try to make the node protect itself first.

Alexey · August 21, 2021, 7:12am

That’s the goal - to make nodes protect itself. The suspension is not a solution as far as I understand especially after showing how easy it could be abused…
I’m against putting money on stake too, but at the moment only that can stop bad actors from abusing the suspension.
It would be cool if we can make it only with math, it would be the best option.

allenyllee · August 21, 2021, 9:21am

Maybe we can borrow the idea of “proof of stake” blockchain, which like cardano?

For example, satellites can monitor the wallet provided by node operator which receive their rewards, and if the node operator doesn’t withdrew any money, just stake it, the more the money * held time it stake, the more the satellite can trust that node.

If the node operator need that money, he can just withdrew it, but his node’s trust score may become lower, in the mean time, for any reason he’s node get audit failure, it will get disqualified soon.

But If he deposit a lot of money into that wallet, his trust score may become higher, and it will not get disqualified too soon. Note that he may still get disqualified, if he tries to abuse the system for a longer time. The staked money only affect how soon it get disqualified, not whether it should get. Here we assume that the bad actor can’t afford the cost to borrow that amount of money to abuse the system for a long time, it will take more than the rewards he tries to get.

Alexey · August 21, 2021, 9:33am

We cannot stake or do anything with tokens, they are not securities or investments, they are utility tokens, the only purpose - simplify the payments.
So this idea would not be implemented, it will be against the purpose of the token.
We also do not hold tokens as a collateral. All calculations are made in USD anyway.

The idea to buy reputation is posted there once - Buying reputation, it was not agreed even by the Community.

Toyoo · August 21, 2021, 10:56am

Please note that @BrightSilence’s simulation is math—a Monte Carlo method for estimating parameters of a probabilistic distribution. He has already suggested a model for analysis. It’s not that we’re trying to avoid math.

I admit that I might have missed it. Do you have in mind a specific message showing the attack vector?

Alexey · August 21, 2021, 1:16pm

I do not want to give ideas how to game the system. Your node can lie to the satellite and enforce it to hit the unknown audit score instead of audit score and your node can survive even if it doesn’t have all pieces in place.
Since your node actually not disqualified, but lies about having pieces and actually not punished for that - the audit checker would have an information, that pieces with high probability still available on your node (which are not). And if there would be not enough really healthy pieces (your node will still answer with timeouts), the customer’s file may be actually lost.

Toyoo · August 21, 2021, 4:35pm

If you do not want to discuss your arguments, all that is left to me is complaining.

Also, this means there are at least two known attacks on the integrity of the network. Not very encouraging.

BrightSilence · August 21, 2021, 6:14pm

I would like to introduce you to the concept of charitable interpretation or charitable listening. It’s the idea that interpreting what someone is saying in the most reasonable way and assuming the best intentions leads to much more valuable debate. This goes doubly if there is a language barrier in place. Which most of the time is the case as both the community as well as storjlings are spread out across the globe.

The way I understand it is that @Alexey said that using suspension instead of disqualification would open things up to abuse. That doesn’t mean the current implementation of suspension can be abused. I’m not even sure what second attack you were talking about.

Further more, I think @Alexey did his best to give an idea of what the problem was without giving specifics of the attack. He did back up his argument.

So I’ll ask you again to please keep the debate constructive. I think this thread has already helped get attention to this issue and develop ideas to fix it. I don’t think antagonizing those who are trying to help would be helpful. Keep in mind that within Storj Labs, @Alexey actually advocates for us SNOs.

Alexey · August 21, 2021, 6:23pm

The current implementation of suspension is not affected.
However, if we implement the suspension because of audit timeouts - it will be. This is related to how easy someone can say nothing after they have been asked by satellite to provide a piece for audit.

I know, my answer won’t satisfy you anyway, because you want to know how exactly you can abuse. And now I want to know - why do you want to know that? Do you want to try to use?

BrightSilence · August 21, 2021, 6:44pm

That doesn’t sound like the most charitable interpretation either. I don’t think this is the case and perhaps more to the point, it doesn’t matter whether a single individual is trying to use an exploit. Whether they are or not, there shouldn’t be anything to exploit to begin with. Which luckily in this case is already the case.

Alexey · August 21, 2021, 6:56pm

Yes for now. But if we implement it - it will become vulnerable.
Because distinguish between hanging node and malicious node not possible on the satellite side.
So, there should be some additional protection. Maybe the best is to cache, which pieces was not provided and ask for them every audit until responded either with a correct piece, corrupted piece or error that will be interpreted as “lost”, so the “timeout” is not an answer.
However it will demand a lot of space to save this for each potentially malicious node (it could be attack vector itself by the way).
So, we need something other.

Iigloo · August 21, 2021, 7:01pm

Storj is not a flexible system for the operators. I understand that a certain of parameters needs to be fullfilled. For me, I only run because I think this is fun. If I get disqualified, then I get disqualified. I agree with @BrightSilence , security shouldn’t depend on obscurity, it should depend on design.

Alexey · August 21, 2021, 7:11pm

I’m thinking about reverting the fix - and stop perform any other audits for that node, until it will give an unambiguous answer for that piece. During that reduces unknown score, after 0.6 stop egress, stop ingress and do not pay for storage. If unknown score is 0, start to reduce audit score. As soon as it become lower than 0.6 - disqualify the node.
I think It could take about 24 hours.
If implement @BrightSilence’s suggestion regarding audit score - up to two days.

BrightSilence · August 21, 2021, 8:10pm

That does sound like a good idea. But I do wonder if HDD corruption could lead to it struggling to read specific sectors and timing out because of that. In theory that could get a node stuck in suspension for quite a while because of a single corrupt piece. Maybe instead of waiting for 0 suspension score it could wait until 50 and then just fail the audit. At that point the SNO has been alerted about the suspension. But still there would be very little time left to fix the issue. I guess that’s not great either.

Another option would be to give the SNO a choice to fail the audit and as long as they don’t, just keep trying the same piece. The only way to get out of suspension then is to either fail the audit or fix the underlying issue and respond with correct data. It would take a while to recover from a score of 0 though.

And just as a side note. The score will never really hit absolute 0. Or at least not for a very long time, because of the memory of old scores. So either way you would probably want to set any limit to switch to audit failures at some point higher than 0. Maybe 10 or something.

Alexey · August 22, 2021, 6:51am

Ok, how about this:

Audit the chunk that returned the timeout until the unknown score drops below 0.6
Stop the ingress and stop paying for audits, egress and storage; the pieces considered unhealthy (I would call that state as frozen). Send a warning.
Reduce the audit score and move on
The node is still frozen and will receive audit requests, egress is possible but not paid too.
if the next audit is finished with positive result - increase the unknown score.
If the unknown score is grow more than 0.6 - unfroze the node
if the next audit is finished with timeout - repeat from p.1, until the unknown score would fall on another 0.4 or become close to zero (lower or equal to 0.2). Send a warning.
Reduce the audit score and move on (the node still frozen)
After the unknown score is exhausted, there is usual containment process has started with 3 attempts to audit the same piece and reducing the audit score in the negative case.
The node will remain frozen until unknown score grows more than 0.6

Maybe just stop ingress and egress while node is frozen would be simpler to implement.
For example, not paying mean that we should not account the storage, egress and audits. These counters are tied hard with customers’ accounts. Perhaps too much to change.

Toyoo · August 22, 2021, 9:47am

Because without this knowledge I don’t know how to make the proposal from the other thread better.

From my side your argument is essentially «I don’t like it, but I won’t tell you why». That’s not a constructive argument that can be worked with.

BrightSilence · August 22, 2021, 11:19am

Yes, I could definitely see that working. It would slow down the disqualification without giving abusers a way to get out of an audit and also give an earlier warning due to disqualification. It checks all the boxes I think.

What I don’t like about it is the added complexity and having to track different statuses. Like: Is this the first, second or third or higher piece that is timing out, with different rules for each scenario. And when do we decide to reset that counter?
I think I may have a suggestion that builds on what you just described, serves the same purpose but comes with a lot less complexity, by reusing systems already in place and generalizing the rules.

Basically it would be a tweak to how containment works.

When an audit times out, enter containment mode.
Each timed out audit in containment mode is considered an unknown failure and reduces the unknown score.
keep auditing the same stripe up to 10 times, reducing the unknown score 9 times, but at the 10th time, give up on that piece, count it as an audit failure and hit the audit score. Then exit containment.

This process is very similar to the current containment implementation but with more tries, with the only difference that it doesn’t just ignore timeouts a few times, but counts them as unknown failures instead. The rest of the systems could be kept exactly as is. Small tweak, but basically gets the same result as what you suggested.

At the current settings, this would allow you to run into this issue with a single piece and not really run into any suspension or disqualification. (I believe you currently need 10 unknown failures to be suspended) But as soon as a second one starts timing out, you will hit suspension and get alerted. Probably fair right? Since if it is rare corruption, the chances of a second piece failing sequentially should be basically 0.

The result would be that you get a suspension (and warning email connected to that) in 1/3rd the time it now takes to disqualify, while at the same time allowing for over 3x as much time to resolve the issue. Without giving anyone the chance to avoid having to cough up the data for an audit.

If my other suggestion is implemented, the dynamics change a little. You could have about 3 consecutive failures before you get suspended. I think that would still be fine, especially for larger nodes. By making it dynamic based on node size this could be tunes further.

So then what’s left is to decide whether the consequences of suspension should change from what they are now. I believe that currently egress still happens on suspended nodes. In these scenarios where nodes are stuck, all that does is guarantee a failed transfer among the bunch, which hurts customer experience if it happens too often. So that’s probably not a great idea. I would say this probably goes for other scenarios of suspension as well. So maybe just get rid of egress for suspended nodes to begin with. Why keep giving them good egress income if they are not living up to the reliability requirements. They’ll get egress again when they’ve resolved the issue and have recovered their scores. So I would suggest audits only at that point. Marking pieces as unhealthy so they get picked up for repair. This gives an added incentive to fix things quickly, because you are losing data to repair and future income as a result.

What do you think?

Alexey · August 22, 2021, 12:49pm

The current system will account suspended node as unhealthy in the repair worker (ergo pieces considered as lost), so this checkbox already closed. The audit score could be affected if the repair worker would hit the exact this piece (as far as I understand the repair workers doesn’t share the logic with audit workers, their logic is much stricter than on audits).

But you are right - it allows egress, this is why I suggested to either do not allow egress, or do not pay for it (to demotivate an abuse). Perhaps the first is better for to do not affect customers with timeouts.