Facing dying hdd

node1 · December 29, 2023, 3:28pm

Hello,

At the moment i’m trying to copy data from dying hdd to the new one. But i see that rsync is not able to read some files as well as some folders.

If rest data will be copied with no problems, and i turn again this node with missing some files and folders. Will it work? Is it acceptable?

What is the best way to deal with this issue?

thank you.

Knowledge · December 29, 2023, 3:52pm

If you’re missing data, you’ll fail audits until your node is disqualified. If it’s a small amount of data, you might gamble that it won’t get audited enough to disqualify you. It may also be data that is only on one particular satellite, so you can get DQ’d there, but still be able to support the others. However, if it is a significant data loss, you will likely be DQ’d on all Sats.

In terms of data recovery, perhaps someone here knows more about rsync and if it is able to recover data that is unreadable. I haven’t dealt with this myself.

snorkel · December 29, 2023, 4:43pm

I believe more than 4% data loss gets your node DQed.

Roxor · December 29, 2023, 5:55pm

It’s often better to clone a damaged HDD or partition with a utility like ddrescue (that’s commonly also included in standalone bootable Linux distros too)… then fsck/chkdsk that cloned filesystem on the good disk.

Rsync isn’t going to recover anything. And if a disk is damage it can take day/weeks to try to fsck/chkdsk it in-place (if it completes at all). So a util like ddrescue will copy every last one and zero it can from the failing HDD (including data that may be damaged) THEN you can quickly fsck/chkdsk it.

node1 · December 29, 2023, 8:21pm

Would be nice to know exactly.

But even 4% data loss in my opinion should not lead to DQ. Sometimes can bad sectors occur, o other HW issues might cause some of data loss.
But then node operator can copy all good data to the new drive. And can run the node further. Of course there can be some “panalties” or something like that for data loss. But node DQ does not look the right way to treat the node operators. Because it is not operators fault if HW fails.

node1 · December 29, 2023, 8:24pm

I’m using rsync not for recover the data. I’m using rsync to transfer data to the new drive
But will take a look at your recommended “ddrescue”.

By the way fsck passes this disk only with some minor issues.

Thank you.

daki82 · December 29, 2023, 8:24pm

its not about threatment of operators, its about integrity of the whole network.
this has top prio over every node. whos fault is not the factor here.

node1 · December 29, 2023, 8:28pm

Lets say 80% data transferred to new hdd and working there great. What is the reason to destroy it?

Ruskiem · December 29, 2023, 8:43pm

my experience is forget rsync, just take it offline and clone it in less than 1-2h with some program or with some docker station (like this https://www.youtube.com/watch?v=mFwdZplM9bg) even 1-2 days offline is nothing, and will recover fast, but You have fullcontro and can check the disk and make sure everything cloned.

daki82 · December 29, 2023, 8:43pm

Satelites decide for themselve if the audited data is enough to keep the dataset (there are 4) active or not, if to much is lost, its to much, however the other satelites may continue with the node.
no matter how good the new disk is.

We sno face it all someday the drive dies. Then we decide start over with new node or let it rest, hopefully profitable. But under normal circumstances you can have more than one node running before the first drive dies.

depends on the distribution over the satelites, if the missing data is distributed over lets say 1% 1% 1% 17% then only one satelite will refuse to work with the node (thats what disqualifikation means technicaly)

Roxor · December 29, 2023, 9:12pm

I doubt fault has anything to do with it - though if a SNO has a HW failure: it’s exclusively their problem. Repairing lost data has a cost (paid to other SNOs) and the company has to maintain the satellites to detect and deal with it. I can understand Storj not wanting to continue to pay a SNO who has destroyed a certain amount of customer data.

Ultimately it’s up to the SNO: they can run things in a way that’s more durable (but less profitable)… or take the chance that they may get DQ"d one day and have to restart with a new identity (but make more $$$ until that DQ occurs)

Alexey · December 30, 2023, 2:39am

I would suggest to give a chance to this approach:

JWvdV · December 30, 2023, 4:39am

It’s not your opinion what takes precedence here, as already started above it’s the costumer and network perspective that takes precedence.

As you can imagine, auditing is a quite time consuming process. As soon as STORJ finds out less than 96% of data is there; it still doesn’t know how much there still is (but can be modulated on the fraction of scores) but moreover which data is there and is not. Than is easier to just consider all data lost.

For sure the best service given here. Try it and hope most data will turn out to be recoverable. And if not, I hope for you it’s unbalanced so only one satellite will DQ you. But you’ll see, anyway no options to do more about it.

Alexey · December 30, 2023, 4:53am

It’s even more complicate. The satellite cannot audit every single piece (do you remember how much time it takes just for filewalker executed by the local process not requested remotely piece by piece?), so it uses a probabilistic model: pieces are audited randomly, and if the certain amount of audits are failed, it will assume that it cannot trust this node anymore. Just this number of failed audits roughly translated to an amount of lost data, see details there:

node1 · December 31, 2023, 11:59am

I’ve moved data to the new HDD. Out of 2TB of data, i have 30 files, that are corrupted.
Ho do i deal with it?

sending incremental file list
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2d/lbyzx4g7exxfp3etudxpkrpu47cqdqmppb4xhelu2z3yzgh42q.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/3t/g4jojgv5djbc5vhbpdawevc776nrwbzcy4adcprn44h2sv3oya.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/5l/w5yj2rgbxs6sv2g4qcnpnhyi2jofos35dv3mtf3oj5pfhahrua.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/5u/vu7nsic2s7uzh266mwwhxlj43fjtlzyh5wucx3bc4yzrorxt3q.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/67/o7ig6kjky64nnvd6mi4etftjac34jzjtx3hl7vq3dmp3mnkqpa.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6c/3bvlg63okyywannzoclo6aze4vd4uifmweqwurgngkndkcypoa.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6o/fpyvx4q7cnfqfhcma5swfvboqiav3xzjcuchegobv63rspxdca.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6u/3mgz7jkh4rpeo633erxv3sfely7d7dxcdxeaerky2zfzwdkv3q.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/c2/krqfgz4ljltv4bnyoyun2ct3jls5hklvjl356b27skm5tcoldq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/cn/m54togby5nox4tvbt3d5fbtz3xsfxdalfetuofpwvo5orreveq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/df/mtclrjgdattbuh5bf6ividzx6f7k762jxhocivmalmd5u4wxeq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/dx/35xn4akv4atam6wnenj2fscunpsauypcyaxdqyzqlmwwhldgja.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/e7/ooepdgrrxhql2s7nj32cgulhujqlda7lht4sbdsaeactlyiwjq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/gc/f4e6snyqjy2cipzwiqhyiomysdsttmm7fxawkas7anvlqlfsha.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/hd/uifiblgevelgmw3uayq6zoabk5cihemsl3bjbd76tnnezu4uia.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/hk/zxbf7qmnlyb7guxb2bvtg5soyr2rxqhm24f4j5bsenhk5gsr3a.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/kt/vxyo7loxvewsvzaohhxj7jh2yeokaurjy4opd7ceswwrq7e2jq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ll/652bdsjfh6qxdg2flitu2ptbczyjsrpyusrzr5aw2ors4dkeaa.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/my/q46xqmx7usnrmtvv4uya7dvct734lnfk4clqiqv2ryyqbjl2jq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/op/nuqvnfblmdv7tnw3xhhqxncmj5n7xhuc2mk6yiefolqkflt64a.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/pp/vkc52onifyv2im5lcjpemrrhfzdn7pl4pxo67yjboxe2w2zvga.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/q6/axeuw5zejr7nroakdi7olitn4ymmz53fmd4otalcurifri5rfq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/tv/pc4nilcrtp7nu4a3g6virwxu5co4okswb46bjk5ekdiri2y3ta.sj1”) failed: Structure needs cleaning (117)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/xc/tgzk52sle7s7p2g6l6cikntzlrpww34hskwyafz3eyt7knkjra.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/xo/xxlp2wadrvs72psyt4y2yawyukadwt3e5lcd3itcfqebq3f2wa.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/xw/bnjpfpom53fhrgavpj5ri4qb632rva5bcvgax3i5ko6d6cufwq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/yc/mrddijhzgwpbtma52t7j66isxzmw75fii55c37a72nhpfi72kq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ze/btfd5quyxpaspimuqfkknh22sht7g3xbn3pijsojbmqe7apkcq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/a7/t3d7ngvmqausson2eaex3joajudf2i7c2sxoourkcycrbniwhq.sj1”) failed: Bad message (74)
rsync: readlink_stat(“/home/user/storagenode/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/p3/na4obtq6j7ksmdb6i7deanjxzwwvndmthlocep2l7vlukydq2q.sj1”) failed: Bad message (74)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1205) [sender=3.1.3]

JWvdV · December 31, 2023, 12:05pm

Just not, you could consider to check the disk with chkdisk (Windows) or e2fsck (Linux, ext{2…4}) and rsync afterwards.

How sure are you the disk has failed? This rather looks like a temporary file system corruption (too few files).

node1 · December 31, 2023, 4:26pm

Actually you might be right. I was thinking about this for a few days now. Maybe the problem not in the HDD. But…
This node (docker container) was very often turning of by it self, saying that disk is read only.
And 2nd thing, this drive was loosing it’s mount under ubuntu. e2fsck did not showed my a lot of problems, as well as smartmontools. And it was keep dropping from it’s mount as well as it was stopping storagenode i just decided to replace it. If it’s only file system corruption, it should not go for umount by it self.

Now the node runs out of another HDD, it did not umount by it self. But storagenode container already stopped one or two times. I was not near computer, so i just restarted container. But it looks like i have to examine this strange behavior deeper.

JWvdV · December 31, 2023, 4:51pm

Then the most important question is: what drives are we talking about? Are these SMR or CMR?

node1 · December 31, 2023, 6:04pm

for sure CMR And this drive was running fine for about 1 year.

Ruskiem · December 31, 2023, 6:11pm

Just do some basic tests with HD Tune Pro (its first 15 days are free trial) will allow You to make quick scan, and performance test like those here look, and You wills e if there some bad sectors, big latency or just a too bad performance on the graph: