I need a ZFS wizard to help me out

I think I messed up. I have a ZFS pool with 2xraidz1 with 4 hdds each.

Today I changed the controller because it was giving me lots of errors. After changing, my pool showed up, but on drive removed itself and got online again and so on, so I swapped this hdd out and started resilvering, since it looked like the hdd had a defect (it happened with rebooting rewiring etc. So it must be a defect).

Resilvering was in progress. After a while I heard some sort of a squeek, the sound the Toshiba mg09 make when it turns off. So it seemed 2 drives randomly turned off and on for a fraction of a second. Now not enough parity was left, since both hdd were unplugged for a short time, so I restarted the PC to re-initialize everything.

The pc got stuck so I forced shutdown (maybe my bad mistake), after restart, one of the working hdds showed as missing, even though it was plugged in. I used lsblk to check. I saw that no partition was shown (the forced restart broke the partition table).

So I used testdisk to recover the partition, now it shows only 1 partition (the main zfs partition) but the other 8m partition is missing.

Now I’m trying to clone this drive, shrink the zfs partition to fit the 8m partition in, since due to the recovery, it’s only 3m left for some reason.

Now I’m standing there with 1 drive half defective, which needs to be resilvered, and a second drive with messed up partitions. Unfortunately zfs shows the drive with messed up partition as missing, I’m afraid to detach and re-attach, because when detached, there is no way back attaching it, when something is missing.

It’s funny since both drives are in one raidz1, which means my data will be fckd up, since there are no sufficient parities. Is there some way I’m missing to do a magic rescue. Who can save me? I don’t want this to be my vilian arc. I hope some ZFS magician can help me out.

TiA

I would post the question over in the Truenas forums. That’s the best place to get sorted. https://forums.truenas.com/
If you are willing they can guide you in getting the system healthy again.

2 Likes

I have seen a lot of those Toshiba MG09 dying but this was always a slow process starting with some bad sectors, none stopped working suddenly.

Your problem sounds to me like some other hardware issue, maybe cables, backplane or power supply.

I believe you don’t need second 8mb partition (What is the purpose of 8MB Solaris reserved 1 partition? Ā· Issue #6110 Ā· openzfs/zfs Ā· GitHub). You can’t shrink zfs partition or you will kill it. Now you should try to import zfs in recovery (read-only) mode and try to save your data on another disks. It’s best method for you because i think you will loose that all if you continue in that beast mode ))

P.S: gpt table is stored in 2 copies on the disk and you could restore it from the backup at the end of the disk using gdisk. But now it’s too late (

The problem is, the affected raidz1 (from both) has only 2 running HDDs out of 4, one needs to be resilvered, the other one has a messed up partition table (data is still there) and wont get recognised properly, I don’t know if I can rescue this much of this fragile state. I read, that I can copy the partition table from a working drive and apply it on the messed up one, to get a chance this drive will be properly recognised again. Personal Data is backup-ed for sure, it’s because of the 70TB storj data. If I loose, I’ll start from scratch, but that’s my last resort, in this case I would donate 70TB of data to the network as repair.

I would try to export this zfs pool and then import it again to see what errors it will give. Also at first you should check your power and data cables to exclude disk disconnection errors during operations

The Problem is, I only have some HDDs for replacements, and not for a new whole system :frowning:

Two drives missing: sounds like dodgy power splitter or faulty SATA controller.

Check all interface and power connections. Disconnect, then reconnect. Replace power supply if you have known good spare.

-70TB of Storj Data - they have paid you to keep spares.

Are all drives listed with ā€˜lsblk’ ?

Did the resolver start automatically or did you start it?

Thats why I switched to a SAS Controller, because the Sata Controller Card threw Errors.

Power Splitter, I don’t know, since I’m using 3x HDD Hot Swap cases with 4 HDDs each, each case has 2 Sata power connectors, I don’t know if they split 2x2 internaly, or combine the power for all 4 drives

I do have Spares, 4x 18TB HDDs, for replacement, but not to build a full new ZFS.

Yes, one drive was in some sort of a on and off mode, I think the controller of the HDD may be damaged, that’s the one I wanted to resilver initially. The other ones are all shown, except the one with the faulty partition table, this was shown as for example ā€œ/dev/sdaā€ instead of ā€œ/dec/sda1 and /dev/sda9ā€ like it is on ZFS. Thats why I was trying to recover the Partition Table with testdisk, which worked, but with a wrong size. The faulty partition was caused, I think, due to the very short power issue. The original data should be still on the drive.

Sorry, what you mean by that?

Sorry. Meant: Did you start the resilver or did it start automatically? If the resilver was automatic and the original disk is still there, it should still mount in read only.

It’s worth removing each drive from the hot swap bay, and refit - when powered off (Not drive from cradle, cradle from hot swap bay). Just to reseat/clean connections.

The resilvering started by itself, it was just showing something like ā€œrandom numerā€ - ā€œwas /dev/ā€¦ā€ so I wasn’t sure. It also was shown as ā€œremovedā€, even though, it was resilvering and didn’t show the Message, that there are insufficient parities, like it was when the two drives shutdown for a fraction of a second. The thing is. The defective drive is replaced by a new one (-1 parity), the the other one has the messed up partition table and is shown as ā€œremovedā€ and ā€œwas /devā€¦ā€, so in my conclusion there are only 2 HDDs left in the array. How is it possible to do a suffivient resilvering?

How about the first defective drive? What exactly is the problem? If it is only some bad sectors zfs should still recognize it and load the pool in degraded state from this drive and the 2 good drives.

The drive had a problem, on zpool status it was shown as online, then removed, than online etc. so the resilvering did reset every time. I put the HDD into another slot in my system to see if it’s a connection issue. In my case the drive had the same behaviour, just in another slot in my system, so I thought, it must’ve been a defect on the drive.

In my opinion this drive is your best chance now. I would connect it to another system, checking SMART data. If there are errors but drive is still working next step would be copying data to a new drive using tools like ddrescue.

As I saw this behaviour, I checked SMART values, which were all good, no errors so far. I think depending on the defect SMART isn’t telling everything. I’ll give it a try and put it back in and do the ā€œonlineā€ command, hoping, it’ll work

The 8MB partition is not important. ZFS creates it to accommodate drive replacements when then new drive can be a few MB (up to 8) smaller than the old drive.

Do you have enough SATA cables to have all drivs plugged in at the same time (the failing ones and the replacements)? If so, do it. ZFS can replace a drive that is still connected and you get more redundancy during the process.

Also, chek power supplies and cables. Clean the connectors with contact cleaner. Multiple drives disappearing/reappearing look like a bad connection somewhere. Or maybe the controller is bad.

I have 8 drives connected to 2 sas ports, my mobo has 2 sata ports left, so I’ll give it a shot.

Th problem is, ZFS doesn’t recognize it as ZFS drive anymore. Testdisk recovered the partitiontable, but maybe not that way like it should be. So I thought cloning the partition table from a working drive will be a better fix, since this will be the last straw I can hold on.

Having 2 drives failing without any SMART errors is hard to believe. So again: This feels like some other hardware problem. How about cables and backplane (hot swap cases)?

Only one failed (additionally), while resilvering, later two drives shut down and did a spinup again, resulting in one of the drive loosing their partition table completely, the other one was not affected. The strange thing is, I just replaced the controller with an LSI9300, since then, when I did a first boot with the drives, they shut down and spinned up during boot (normaly they just run, when they get power) like when you unplug power for a short time. I don’t know if its the controller which told the HDDs to do so, or the controller spin up every HDD at once creating a peak in power consumption, what the old controller maybe did not.

BTW:
I have some bad experience with such HDD cases. Once I had 3 of them in a windows server where drives occasional disappeared and only came back after reboot. I don’t use such hardware anymore.