This week I have been graced with two failing hard drives affecting two nodes. I’m working to pull off as much data as possible, and it looks like optimistically I’ll still have some file loss. Like a few gigabytes per terabyte.
(I’m using the traditional piecesstore, so we’re talking millions of files).
Old threads have indicated that there’s nothing to be done, but just in case: is there any way I can trigger an audit or repair process when my nodes come back online? Something to report back to the satellite what files actually exist and which don’t?
It would be nice for two purposes:
Selfishly, maybe a way to avoid node disqualification
Altruistically, a way to inform the network that of the 30 shards of data it had for a file, it’s really only 29 now and to trigger the repair process.
Otherwise, my gameplan is to YOLO it and bring the nodes back on line and just see if they get disqualified over several days.
(as an aside, if anyone knows a way to attempt to recover files off a ZFS drive that won’t even mount any more, let me know!
Yeah you’ll probably take a small audit hit: but I bet you’ll still be fine. As most data eventually gets deleted… your score can still improve as it goes away and gets replaced.
Yes, literally just one drive. I was operating it as a single independent drive for the node, it just happened to be ZFS. And now it has only a fleeting ability to be brought online.
The other failing, (also independent), drive seems to be doing a good enough job of letting me migrate the data off, but that will take another couple days to finish.
Off-topic: but I can understand SNOs who use parity configs for their nodes. Sure, Storj and their customers may not need it… but it can take years to grow a node: losing one and starting again is painful. The risk isn’t actually that space taken by parity limits your earnings: since so few disks fill.
Since the drive is failing, use ddrescue to copy it in its entirety to another, good, drive. Copying as image should be fast (other than the spots where the bad sectors are), faster than copying individual files anyway.
After that, you can try mounting it and copying the files somewhere else.
I had to replace like 10 disks out of 100 so far. No node was lost. So I would say don’t double your disk costs, just watch smart reports and start migration early.
right now I’ve got both drives mounted read only becuase they are just in rough shape
pool: MDD14TBstate: ONLINEstatus: One or more devices has experienced an error resulting in datacorruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore theentire pool from backup.see: scan: resilvered 2.11M in 00:00:03 with 0 errors on Sun Aug 17 15:54:12 2025config:
Oh dmesg is spamming ALL SORTS of errors on these. I thought my old HBA was giving problems, so I recently swapped HBA and SAS breakout cables. Problem got worse. Then I changed to using SATA cables plugged into motherboard (and a different power cable), problem got worse.
After I migrate what I can while plugged in to the server I’ll probably try a USB enclosure and see if anything new happens.
Did you read out smart Infos from your hard drives? Most often these show if they are the problem. If they have a high amount of reallocated sectors and/or other errors, most often the drives themselves are failing. This is most often the first thing I do, if I have problems with my drives. And most often I was able the find the defective one pretty fast
yes, I should have mentioned that. Thousands in Seek Error Rate and Raw Read Error Rate. Ironically not below the SMART threshold for “failed drive” but enough to ruin ZFS and keep putting the drive in suspended states.
any I just finished my first rclone pass on the first failing drive, and am running through a second attempt. So far it’s only copied an additional 3MB out of 17 million files so I’ve probably got everything I’m going to get out of that method. Then I can remove it, experiment with USB, and then RMA it (one drive is under refurb warranty, the other isn’t).
Two to four years old, so not in line with bathtub theory. Toshiba replaced them all under warranty. Sometimes I even got a 20TB model as replacement for 18TB.