If you’re missing data, you’ll fail audits until your node is disqualified. If it’s a small amount of data, you might gamble that it won’t get audited enough to disqualify you. It may also be data that is only on one particular satellite, so you can get DQ’d there, but still be able to support the others. However, if it is a significant data loss, you will likely be DQ’d on all Sats.
In terms of data recovery, perhaps someone here knows more about rsync and if it is able to recover data that is unreadable. I haven’t dealt with this myself.
It’s often better to clone a damaged HDD or partition with a utility like ddrescue (that’s commonly also included in standalone bootable Linux distros too)… then fsck/chkdsk that cloned filesystem on the good disk.
Rsync isn’t going to recover anything. And if a disk is damage it can take day/weeks to try to fsck/chkdsk it in-place (if it completes at all). So a util like ddrescue will copy every last one and zero it can from the failing HDD (including data that may be damaged) THEN you can quickly fsck/chkdsk it.
But even 4% data loss in my opinion should not lead to DQ. Sometimes can bad sectors occur, o other HW issues might cause some of data loss.
But then node operator can copy all good data to the new drive. And can run the node further. Of course there can be some “panalties” or something like that for data loss. But node DQ does not look the right way to treat the node operators. Because it is not operators fault if HW fails.
my experience is forget rsync, just take it offline and clone it in less than 1-2h with some program or with some docker station (like this https://www.youtube.com/watch?v=mFwdZplM9bg) even 1-2 days offline is nothing, and will recover fast, but You have fullcontro and can check the disk and make sure everything cloned.
Satelites decide for themselve if the audited data is enough to keep the dataset (there are 4) active or not, if to much is lost, its to much, however the other satelites may continue with the node.
no matter how good the new disk is.
We sno face it all someday the drive dies. Then we decide start over with new node or let it rest, hopefully profitable. But under normal circumstances you can have more than one node running before the first drive dies.
depends on the distribution over the satelites, if the missing data is distributed over lets say 1% 1% 1% 17% then only one satelite will refuse to work with the node (thats what disqualifikation means technicaly)
I doubt fault has anything to do with it - though if a SNO has a HW failure: it’s exclusively their problem. Repairing lost data has a cost (paid to other SNOs) and the company has to maintain the satellites to detect and deal with it. I can understand Storj not wanting to continue to pay a SNO who has destroyed a certain amount of customer data.
Ultimately it’s up to the SNO: they can run things in a way that’s more durable (but less profitable)… or take the chance that they may get DQ"d one day and have to restart with a new identity (but make more $$$ until that DQ occurs)
It’s not your opinion what takes precedence here, as already started above it’s the costumer and network perspective that takes precedence.
As you can imagine, auditing is a quite time consuming process. As soon as STORJ finds out less than 96% of data is there; it still doesn’t know how much there still is (but can be modulated on the fraction of scores) but moreover which data is there and is not. Than is easier to just consider all data lost.
For sure the best service given here. Try it and hope most data will turn out to be recoverable. And if not, I hope for you it’s unbalanced so only one satellite will DQ you. But you’ll see, anyway no options to do more about it.
It’s even more complicate. The satellite cannot audit every single piece (do you remember how much time it takes just for filewalker executed by the local process not requested remotely piece by piece?), so it uses a probabilistic model: pieces are audited randomly, and if the certain amount of audits are failed, it will assume that it cannot trust this node anymore. Just this number of failed audits roughly translated to an amount of lost data, see details there:
Actually you might be right. I was thinking about this for a few days now. Maybe the problem not in the HDD. But…
This node (docker container) was very often turning of by it self, saying that disk is read only.
And 2nd thing, this drive was loosing it’s mount under ubuntu. e2fsck did not showed my a lot of problems, as well as smartmontools. And it was keep dropping from it’s mount as well as it was stopping storagenode i just decided to replace it. If it’s only file system corruption, it should not go for umount by it self.
Now the node runs out of another HDD, it did not umount by it self. But storagenode container already stopped one or two times. I was not near computer, so i just restarted container. But it looks like i have to examine this strange behavior deeper.
Just do some basic tests with HD Tune Pro (its first 15 days are free trial) will allow You to make quick scan, and performance test like those here look, and You wills e if there some bad sectors, big latency or just a too bad performance on the graph: