OR, instead of irreversible DQ, the node gets suspended after failing X audits. The operator can apply for un-suspension, but then the satellite checks for those same files that the node failed the first time.
However then the node operator knows which files are relevant and a malicious actor might use this information to play games. I don’t know, but maybe.
So I think an approach where a node operator does not know what will get audited is required.
Something else should be audited as well, but if the claim is that “all the files are there, just the USB cable fell out” or similar, then the node should be able to produce the files it failed before.
If those same files were not rechecked, then a node that has actually lost them could still pass the other audits.
I don’t have a problem with nodes being disqualified for actually losing data. My problem is nodes that are disqualified for timeouts or some system problems without giving reasonable time for the operator to fix those problems. If Storj wants the node operators to be regular people instead of datacenters, then there should be no expectation of a datacenter-like reaction time. I do not have staff on-call when I go to sleep or go on vacation, even though my setup is probably one of the more datacenter-like otherwise.
What I am saying is that it might not be enough to recheck only those pieces where a malicious actor can simply pull them from a log and knows what is going to be audited to get reinstated.
To get a node that lost its reputation back into the network there probably needs to be more.
Not because of the good participants, where indeed a lose USB cable got fixed and the node is in perfect shape. But because of the bad actors who might tamper with the data.
Sure, but rechecking the same files would prevent someone who actually lost them from being reinstated. Additional checks of previously-unchecked files may be needed as well.
well shutting down a node before it gets DQ is completely within the scope of what node operators can do, using logs and scripts…
but doing something as simple as saying if 3 audits fail within 1 hour or whatever then it just shut down the node…
i thought about doing this early on, because i was worried about immediate DQ if something went wrong…
but lots of things have gone wrong, and i haven’t seen any signs of this being a real issue… i still think its more people being worried and people being justifiably DQ but not understanding their storage solution or hardware was flawed.
but if we as SNO’s introduce such mechanics into the network, it is a bad thing for the network as nodes would be near impossible to DQ for the network, which might have unforeseen effects.
however it’s just a matter of time i suppose until somebody publishes a script for it.
I’m not worried about that. If there is a real issue, the node would either have to stay offline or continue to fail audits. It wouldn’t be able to avoid DQ anyway. And if the node is offline, the data will also be marked as unhealthy and repair will already kick in. So the network will be completely fine either way.
Before the readability and writability checks were in place, sudden DQ did happen. So I understand people being worried. By far the biggest cause has been fixed, but a stuck IO system can still cause issues. That is already in the process of being fixed too. I think after that we may have seen the last of this. And if not, we’ll keep an eye out for it and make sure Storj Labs knows if there are other issues to address.
It sounds like we have a consensus, at least as far as the present suggestion is concerned. I’ll get it implemented!
The question here is, should a node that has proven to lose files should be allowed back into the network?
I think the idea of reinstating a node back into the network should cover cases where files have not been lost.
I don’t think it should. That’s why the recheck of the same files. If reinstatement only required passing a lot of audits of random files (not the same ones that previously failed) then a node that has actually lost files could be reinstated.
If the requirement is to pass the same audits that failed before, then only a node that has not actually lost data (USB cable fell out etc) could be recovered.
Unless, of course, the failed audits were because of a node or satellite software bug (satellite deletes a file and then tries to audit it etc).
But that is not my idea. Random file check has to be additionally to checking of the previous audited files. What I am saying is that only checking the previous audited files is not enough because a bad node operator knows what files will be checked and could act according to that knowledge. This could open a path to game the system.
We both agree then and it was a misunderstanding.
In my humble opinion, it depends how many files were lost: if “natural” bitrot killed a handful of files out of zillions of files, I don’t think that’s such an issue as long as the rest of the disk is fine.
I think a slight tolerance is in order here to avoid DQing a node just because it lost 2 files on the whole node. On TBs of data, this can happen (bitrot, sudden power loss, isolated bug, race condition…) and that’s why redondancy is implemented at network (as in Storj) level.
Finally that is a Storj decision.
But the case I was referring to is a node that has already been disqualified. And the question is, what is required to let this node back into the network. So the situation is that such a node has already hit the DQ-threshold.
I really think a re-qualification like mentioned can only be done in cases where the node can prove that it did not loose any files during extensive auditing. As you would never be able to audit every single piece, the only good audit result would be that there is no single piece missing.
the requalification idea sounds interesting… tho i’m not sure how practically viable it actually is.
as i understand the problem, nodes cannot be requalified since auditing all the data is a difficult and data heavy task…
in theory tho, i suppose one could have like checksums of larger clusters of files, but these would change every time a file in the cluster would be updated.
zfs does it in some way… so it’s possible atleast…
but a scrub, as its called in zfs and or checksum verification in general, takes a lot of time especially on small files.
and even then, it might not be much better than what the audit system can do, except a scrub would give an semi accurate result of which data had been lost, while the audit system would be a gross estimate.
if it could save a node here and there that would be great… i mean requalification is a viable idea… but it might be very difficult to work into the existing code base of the storagenode and satellite software.
ofc it might not be… i really don’t have a clue… just saying it might not hold up in a short term cost vs benefit analysis… however long term such a feature would both benefit StorjLabs and SNO’s, so its certainly something worth trying to push for…
i doubt we should expect it this year…
really like the idea tho.
One thing that might be welcome is if a Satellite dq’s you, you could have it purge your data out and start over for just that one Sat. I have some nodes that, for various reasons, had issues with one more sats. They just remain dq’d while the others keep on working.
We have suggested that a long time ago but its not on priority. I am DQed on 1 satellite too but as Alexey suggested I added that satellite to my ignore list. Now I don’t have to look at that banner tattooed on my dashboard’s forehead about being DQed. Also no more red box with DQed sat info.
i suppose one could also just make a new node running only the sats that another one is DQ on… so there is already a workaround…
would be a nice little feature tho.
the ability to restart working for a satellite, might not be easy to work into the programming tho… depending on how its made…
however i would guess it should be, since the plan is to have all kinds of satellites that could come and go
@thepaul has already commented on this. The intention is to allow up to at most 2% loss to be survivable. But disqualify nodes with 4% loss or more. Ultimately if you want to prove your node has recovered from a temporary issue you would have to provide sufficient proof for that. One way to do that might be to reset the score and audit all previously failed pieces. If that results in a score less than 98%, tough luck, you’re out for good. If your score remains above that, the node could be put back in to vetting mode or something.
That said, I think recovering from temporary issues in this way is a patch rather than a solution. If the node was better able to detect temporary issues, it could shut down and protect itself before DQ even kicks in. Which would be a lot better than trying to fix it afterwards.
I agree, though some issues may be difficult to handle or be something nobody thought of before. The node should shut down after two file-not-found errors (different files) in a row (USB cable fell out). It also should shut down after timing out trying to read two files (I/O system frozen). Node operator could override this as part of recovery, but the node should shut down early to avoid DQ.