HDD failure, partial file loss, a way to trigger an audit/repair?

Dear fellow SNO’s,

This week I have been graced with two failing hard drives affecting two nodes. I’m working to pull off as much data as possible, and it looks like optimistically I’ll still have some file loss. Like a few gigabytes per terabyte.

(I’m using the traditional piecesstore, so we’re talking millions of files).

Old threads have indicated that there’s nothing to be done, but just in case: is there any way I can trigger an audit or repair process when my nodes come back online? Something to report back to the satellite what files actually exist and which don’t?

It would be nice for two purposes:

  1. Selfishly, maybe a way to avoid node disqualification

  2. Altruistically, a way to inform the network that of the 30 shards of data it had for a file, it’s really only 29 now and to trigger the repair process.

Otherwise, my gameplan is to YOLO it and bring the nodes back on line and just see if they get disqualified over several days.

(as an aside, if anyone knows a way to attempt to recover files off a ZFS drive that won’t even mount any more, let me know!

No, there is no way to trigger audit of the whole repository – audit is already running as fast as possible.

I would not worry about this at all – network fully expects some bits to rot.

What does it mean? Just one drive? or an array? Are all members present?

1 Like

No.

If you cannot recover the files on your end, then that’s the only way.

I think with a couple GB per TB gone you’ll likely still pass audit checks.

1 Like

Yeah you’ll probably take a small audit hit: but I bet you’ll still be fine. As most data eventually gets deleted… your score can still improve as it goes away and gets replaced.

Yes, literally just one drive. I was operating it as a single independent drive for the node, it just happened to be ZFS. And now it has only a fleeting ability to be brought online.

The other failing, (also independent), drive seems to be doing a good enough job of letting me migrate the data off, but that will take another couple days to finish.

Off-topic: but I can understand SNOs who use parity configs for their nodes. Sure, Storj and their customers may not need it… but it can take years to grow a node: losing one and starting again is painful. The risk isn’t actually that space taken by parity limits your earnings: since so few disks fill.

1 Like

What is output of “zpool status” and “zpool import”?

Since the drive is failing, use ddrescue to copy it in its entirety to another, good, drive. Copying as image should be fast (other than the spots where the bad sectors are), faster than copying individual files anyway.

After that, you can try mounting it and copying the files somewhere else.

2 Likes

I had to replace like 10 disks out of 100 so far. No node was lost. So I would say don’t double your disk costs, just watch smart reports and start migration early.

right now I’ve got both drives mounted read only becuase they are just in rough shape

pool: MDD14TBstate: ONLINEstatus: One or more devices has experienced an error resulting in datacorruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore theentire pool from backup.see: scan: resilvered 2.11M in 00:00:03 with 0 errors on Sun Aug 17 15:54:12 2025config:

NAME STATE READ WRITE CKSUM

MDD14TB ONLINE 0 0 0

8d5e9ffc-c152-4d67-aa69-2053a00cce20 ONLINE 36 0 43.5K

 cache

wwn-0x5000cca08c028620-part9 ONLINE 0 0 0

errors: 7564 data errors, use ‘-v’ for a list

That’s not too bad. I would run scrub. (You can image the disk beforehand if you want to be ultra safe).

Are you sure this is a disk failure and not cabling/port/hba? Anything interesting in system messages?

Over what period of time?

But I agree, modern (this century) disks are not designed to be used standalone.

2 Likes

Oh dmesg is spamming ALL SORTS of errors on these. I thought my old HBA was giving problems, so I recently swapped HBA and SAS breakout cables. Problem got worse. Then I changed to using SATA cables plugged into motherboard (and a different power cable), problem got worse.

After I migrate what I can while plugged in to the server I’ll probably try a USB enclosure and see if anything new happens.

Power Supply issue?

1 Like

Did you read out smart Infos from your hard drives? Most often these show if they are the problem. If they have a high amount of reallocated sectors and/or other errors, most often the drives themselves are failing. This is most often the first thing I do, if I have problems with my drives. And most often I was able the find the defective one pretty fast

Roughly 4 years.

BTW: The 100 are a mix of Seagate, WD and Toshiba of different size. The failed 10 are all 18 TB Toshiba.

toshiba mg series has 5years warety, how old are they?

yes, I should have mentioned that. Thousands in Seek Error Rate and Raw Read Error Rate. Ironically not below the SMART threshold for “failed drive” but enough to ruin ZFS and keep putting the drive in suspended states.

any I just finished my first rclone pass on the first failing drive, and am running through a second attempt. So far it’s only copied an additional 3MB out of 17 million files so I’ve probably got everything I’m going to get out of that method. Then I can remove it, experiment with USB, and then RMA it (one drive is under refurb warranty, the other isn’t).

Two to four years old, so not in line with bathtub theory. Toshiba replaced them all under warranty. Sometimes I even got a 20TB model as replacement for 18TB.

interesteing, so far i had only 1 toshiba broken, from all that i installed, also to clients