HDD failure, partial file loss, a way to trigger an audit/repair?

I also have 14, 16 and 20 TB Toshibas without any problems so far. The 18TB was the first MAMR model so I guess problems are related to this new technology.

Partial update, well, I migrated the data for a node and restarted and after a day or two I got disqualified for US1. bummer.

Weirdly, looking through the audit it seems the only audits I failed were ““error”: “hashstore: file does not exist””. But I’m not using hashstore yet. I wonder if there was a directory I needed to exist somewhere though that was giving a problem?

I just did a forget untrusted satellites for US1, but I’ll probably shut down that whole node since without US1 there isn’t really enough data.

Oh, and regarding the technical nature of the failing hard drives… yeah it’s weird. So I had both plugged into the motherboard’s SATA controller for debugging. And the kernel would have messages on a regular basis about DMA errors and link resets as well as ZFS throwing i/o errors. When I only did activity on one drive at a time instead of both at a time the errors were reduced. When I removed one drive and changed the SATA cable the errors were reduced more. But never went away. So I may have had a marginal sata cable in my stash too, combined with other problems. But the drives were definitely failing anyway.

After finished copying what I could with SATA, I unplugged the drive and plugged it into a USB adapter and did another rclone attempt. It copied an additional 3MB ouf of the 2 TB, but it was just to make sure.

the other failing drive I’m still attemping to migrate the pieces off of. it’s just running very slow. (8 million files so far).

one drive was still under refurb warranty from goharddrive, so I already shipped it off for RMA.

1 Like

It’s just a last check, unless you have hashstore enabled (specifically ReadNewFirst). if hashstore is enabled it will check hashstore then piecestore. If hashstore is not enabled the check is happening in the reverse order.

So, it’s just mean that this piece is lost, it was not found in piecestore and was not found in hashstore. This is a “known” audit error, so it affects the audit score immediately.

okay that’s encouraging, I thought I had something set wrong with my hashstore workaround. And I guess it explains the incongruity of how it’s the pieces store program reporting the hashstore error.

Aw poop, my other node affected by the disk problems was also just disqualified by US1. Audit score dropped under 96%.

On the positive side, this leaves me with two baby size nodes with only a little bit of data from AP1 and EU1, so I can use those to test out hashstore migration.

ironically, one of the hard drives that crapped out is not showing any SMART errors at all, I don’t think (now that I learned was the Seagate seek and read errors mean), and it may be… fine? I’m going to remove it and test it really hard before doing anything with it.

No smart error COULD indicate faulty connectors/cables?

Many times all I’ve seen with SMART is the drives temperature increase a couple of degrees (over days) before it dies.

1 Like

So, as a coda to my issue (yes, I looked up “coda” and I think that’s what this is)…

It was the stupid SATA power splitter cable.

i had removed the complaining hard drives, had one of them RMA’d, and was testing them separately with a USB adapter on my Windows machine and they were both working splendidly.

When I reinstalled them in my server case, I rearranged a few drives, and saw that I started getting the same sort of drive reset errors continuously on two different known-good drives. by this point the hard drives, SAS controller and SAS cables had already been replaced. Resetting them power with different plug didn’t make a difference. I had to remove the entire SATA splitter cable and replace it.

Now it’s been 12 hours of badblocks without even a whiff of a drive error.

Sigh. So i just need to pour one out for my two disqualified nodes.

(I think I’m going to use those to test out hashstore migration, then gracefully retire them.)

3 Likes

Holy shit. But yes.. crap cables are crappy.

Becuse they are so cheap market is flooded with low quality trash. I don’t know where to buy quality cables anymore.

So I solved the problem differently: No more cables. Backplane or go home. The backplane connects with mini SAS to the host adapter: being much more “exotic” and expensive cables compared to sata, the amount of shit on the market is lower and due to relative complexity of the cable they tend to be of better quality too: those who chose to make them likely have equipment that is a bit more advanced than what they can away with for stamping out SATA.

Power delivery — same. PDB from old servers. They don’t suck — they power backplane with 8981 molex, old as dinosaur turd, and though the magic of not being made for retail — works well.

This approach, ironically enough y, helps avoids dealing with modern retail e-waste, and prevents old and good hardware from taking space in landfills. Win-win.

1 Like

Universal truth

20 characters