That depends on what you wanted your statement to mean. I assume here you don’t want to mean that exactly four 1kB pieces not received within a 5 minute period each will lead to node disqualification, and that there are also other factors involved in the decision on disqualification, but you don’t want to discuss them right now. So all you need is to make the statement less definite by adding some wording that at least suggests existence of these other factors.
«proof», «reason» are strong words, they imply certainty. «symptom» or «sign» imply an indication of something happening, but without the nuance that they’re the only factors taken into account.
So, maybe something like: «Storj considers failing to provide a 1 kB chunk of data within 5 minutes a sign of node being currently unreliable and hence a candidate for disqualification in future.»
But if indeed you wanted to state that four failed audits should lead to disqualification, then you expressed that pretty much to the point.
Mathematical algorithms show that failure to provide a 1K chunk within 5 minutes in a repetitive manner indicates that the node is becoming unreliable and will be disqualified as a result of continued failures.
Storj Engineers thinks, that in some cases (“old node, which suddenly started to fail audits because of timeouts”) failing audits because of timeouts is not attempt to abuse the audit system and want to avoid quick disqualification of such nodes without affecting reliability of the network.
We did not find a way yet how to achieve that.
Thanks for this message, I really appreciate the switch in tone!
I think I can give some clarification on this part. @Alexey linked the containment blueprint in an earlier message, but that’s a lot to read. However, this is what it comes down to. If your node is unable to respond to an audit within 5 minutes, it isn’t immediately seen as a failed audit. Instead whenever the next audit happens for your node, it will request the same stripe again. This ensures you can’t use a timeout to get out of having to verify you still have that piece. You again get 5 minutes to respond. I believe you get a total of 3 retries, but I’m not entirely sure of that number. If you time out on all tries, then it will count as a single audit failure. From that point on audits will request a new stripe again. This wouldn’t cause a disqualification, just a single audit failure. You’d need quite a few more to actually be disqualified. Problem is that nodes with lots of data get quite frequent audits. So even if it takes 4 timed out audits to count as a single failure, you can still rack up those failures pretty fast.
I previously posted a suggestion to tune audit scoring here. Tuning audit scoring
I think this suggestion could be further tuned by raising the lambda value based on the amount of data stored. The more data you have, the more the memory of previous successful audits weighs into the formula. This would essentially remedy the inverse relation between long time loyal node and short time to disqualify. These nodes also have much more on the line, so messing with the system on such nodes is just a very bad idea to begin with. If you’re going to try and cheat the system you wouldn’t do that with a node with so much income. Just a suggestion, but I think this would help a lot with this problem without exposing the satellites to more risk.
Yeah, I actually remembered your post when typing one of the previous messages here. However, I believe it solves a different problem, not the one discussed here. Please consider that in your simulations you are assuming independence in subsequent audit failures. This nicely model a hard drive having bad sectors. However, in case of transient system-level problems, like heavy swapping because another task hosted on the same box took all memory and then some, you might fail all audits until the box is restarted.
You state there that your experimental choice of parameters would allow 40 audits in a row to be failed. From my point of view, this is too much and too little at the same time. It’s too little when you consider Storj now tells operators that they can go to vacations, because if their nodes become offline for 2 weeks, that’s not a big problem. Stating then that they’d have 40 hours to fix a misbehaving node makes that statement much weaker. It’s too much because the node is still considered a candidate for fresh uploads/downloads, which will obviously fail the same way audits do, making customer experience worse.
That’s why I like the idea from the other thread: it explicitly puts the node that is failing audits into a special state, not trusted enough to handle traffic, but still with hope for recovery.
I think it could be both.
However, if we do not go the mathematical way (I think mathematics is the best solution, despite the fact that I am in the same boat - I recently migrated my Windows server to Ubuntu, and I have a lot of “general protection fault” errors, by the way, they lead to exactly this situation, and my own nodes are affected, they are not DQed yet, but I’m not sure if that can’t happen anytime soon), then it should be expensive enough not to try to abuse the timeout to avoid auditing. And I mean literally - you have to lose money to make sure this is a bad way to abuse the system.
That may have been the initial goal, but the suggestions main focus is to give the score a longer memory. Even without tuning, the current system disqualifies a node after 10 consecutive failures while my suggestion would require at least 40 consecutive failures. That already gives you 4x as much time to fix things. If that memory is further increased for nodes with a lot of data, you could easily tweak it to give those nodes 10x as much time. I would agree that it’s not a full solution and it would be better for the node to protect itself. But you can’t argue that that wouldn’t help a lot already.
Making this part dynamic makes it so that you can aim for a specific time frame to give nodes to repair issues. That should take care of the fact that it would otherwise take 2 weeks for smaller nodes and a few hours for larger nodes. The timeframe would be closer together for all nodes and you could aim at something like maybe 2 days for everyone.
I am not against putting nodes in a warning state earlier, before disqualifying them… in fact, I suggested exactly that at the bottom of that topic. However, it would require making the score more stable first, otherwise nodes would just constantly pingpong in and out of that warning state.
I wouldn’t advocate for that. Having money at stake is one of the reason I never got started with Sia. Maybe if it’s losing income, like temporarily holding back held amount again or something. That might work. But I don’t want to hide away an issue with just monetary consequences instead of disqualifications. In my opinion it should be possible to prevent honest nodes from having those bad consequences altogether. The way I see it, right now honest nodes are at a disadvantage, because someone abusing the system is definitely just going to take their node offline to postpone being disqualified, yet honest nodes currently don’t do that. So lets try to make the node protect itself first.
That’s the goal - to make nodes protect itself. The suspension is not a solution as far as I understand especially after showing how easy it could be abused…
I’m against putting money on stake too, but at the moment only that can stop bad actors from abusing the suspension.
It would be cool if we can make it only with math, it would be the best option.
Maybe we can borrow the idea of “proof of stake” blockchain, which like cardano?
For example, satellites can monitor the wallet provided by node operator which receive their rewards, and if the node operator doesn’t withdrew any money, just stake it, the more the money * held time it stake, the more the satellite can trust that node.
If the node operator need that money, he can just withdrew it, but his node’s trust score may become lower, in the mean time, for any reason he’s node get audit failure, it will get disqualified soon.
But If he deposit a lot of money into that wallet, his trust score may become higher, and it will not get disqualified too soon. Note that he may still get disqualified, if he tries to abuse the system for a longer time. The staked money only affect how soon it get disqualified, not whether it should get. Here we assume that the bad actor can’t afford the cost to borrow that amount of money to abuse the system for a long time, it will take more than the rewards he tries to get.
We cannot stake or do anything with tokens, they are not securities or investments, they are utility tokens, the only purpose - simplify the payments.
So this idea would not be implemented, it will be against the purpose of the token.
We also do not hold tokens as a collateral. All calculations are made in USD anyway.
The idea to buy reputation is posted there once - Buying reputation, it was not agreed even by the Community.
Please note that @BrightSilence’s simulation is math—a Monte Carlo method for estimating parameters of a probabilistic distribution. He has already suggested a model for analysis. It’s not that we’re trying to avoid math.
I admit that I might have missed it. Do you have in mind a specific message showing the attack vector?
I do not want to give ideas how to game the system. Your node can lie to the satellite and enforce it to hit the unknown audit score instead of audit score and your node can survive even if it doesn’t have all pieces in place.
Since your node actually not disqualified, but lies about having pieces and actually not punished for that - the audit checker would have an information, that pieces with high probability still available on your node (which are not). And if there would be not enough really healthy pieces (your node will still answer with timeouts), the customer’s file may be actually lost.
I would like to introduce you to the concept of charitable interpretation or charitable listening. It’s the idea that interpreting what someone is saying in the most reasonable way and assuming the best intentions leads to much more valuable debate. This goes doubly if there is a language barrier in place. Which most of the time is the case as both the community as well as storjlings are spread out across the globe.
The way I understand it is that @Alexey said that using suspension instead of disqualification would open things up to abuse. That doesn’t mean the current implementation of suspension can be abused. I’m not even sure what second attack you were talking about.
Further more, I think @Alexey did his best to give an idea of what the problem was without giving specifics of the attack. He did back up his argument.
So I’ll ask you again to please keep the debate constructive. I think this thread has already helped get attention to this issue and develop ideas to fix it. I don’t think antagonizing those who are trying to help would be helpful. Keep in mind that within Storj Labs, @Alexey actually advocates for us SNOs.
The current implementation of suspension is not affected.
However, if we implement the suspension because of audit timeouts - it will be. This is related to how easy someone can say nothing after they have been asked by satellite to provide a piece for audit.
I know, my answer won’t satisfy you anyway, because you want to know how exactly you can abuse. And now I want to know - why do you want to know that? Do you want to try to use?
That doesn’t sound like the most charitable interpretation either. I don’t think this is the case and perhaps more to the point, it doesn’t matter whether a single individual is trying to use an exploit. Whether they are or not, there shouldn’t be anything to exploit to begin with. Which luckily in this case is already the case.
Yes for now. But if we implement it - it will become vulnerable.
Because distinguish between hanging node and malicious node not possible on the satellite side.
So, there should be some additional protection. Maybe the best is to cache, which pieces was not provided and ask for them every audit until responded either with a correct piece, corrupted piece or error that will be interpreted as “lost”, so the “timeout” is not an answer.
However it will demand a lot of space to save this for each potentially malicious node (it could be attack vector itself by the way).
So, we need something other.
Storj is not a flexible system for the operators. I understand that a certain of parameters needs to be fullfilled. For me, I only run because I think this is fun. If I get disqualified, then I get disqualified. I agree with @BrightSilence , security shouldn’t depend on obscurity, it should depend on design.
I’m thinking about reverting the fix - and stop perform any other audits for that node, until it will give an unambiguous answer for that piece. During that reduces unknown score, after 0.6 stop egress, stop ingress and do not pay for storage. If unknown score is 0, start to reduce audit score. As soon as it become lower than 0.6 - disqualify the node.
I think It could take about 24 hours.
If implement @BrightSilence’s suggestion regarding audit score - up to two days.
That does sound like a good idea. But I do wonder if HDD corruption could lead to it struggling to read specific sectors and timing out because of that. In theory that could get a node stuck in suspension for quite a while because of a single corrupt piece. Maybe instead of waiting for 0 suspension score it could wait until 50 and then just fail the audit. At that point the SNO has been alerted about the suspension. But still there would be very little time left to fix the issue. I guess that’s not great either.
Another option would be to give the SNO a choice to fail the audit and as long as they don’t, just keep trying the same piece. The only way to get out of suspension then is to either fail the audit or fix the underlying issue and respond with correct data. It would take a while to recover from a score of 0 though.
And just as a side note. The score will never really hit absolute 0. Or at least not for a very long time, because of the memory of old scores. So either way you would probably want to set any limit to switch to audit failures at some point higher than 0. Maybe 10 or something.
Audit the chunk that returned the timeout until the unknown score drops below 0.6
Stop the ingress and stop paying for audits, egress and storage; the pieces considered unhealthy (I would call that state as frozen). Send a warning.
Reduce the audit score and move on
The node is still frozen and will receive audit requests, egress is possible but not paid too.
if the next audit is finished with positive result - increase the unknown score.
If the unknown score is grow more than 0.6 - unfroze the node
if the next audit is finished with timeout - repeat from p.1, until the unknown score would fall on another 0.4 or become close to zero (lower or equal to 0.2). Send a warning.
Reduce the audit score and move on (the node still frozen)
After the unknown score is exhausted, there is usual containment process has started with 3 attempts to audit the same piece and reducing the audit score in the negative case.
The node will remain frozen until unknown score grows more than 0.6
Maybe just stop ingress and egress while node is frozen would be simpler to implement.
For example, not paying mean that we should not account the storage, egress and audits. These counters are tied hard with customers’ accounts. Perhaps too much to change.