Damaged ZFS drive causing high IO wait and boot issues

littleskunk · July 4, 2021, 8:08pm

I am using some old hard drives that are reporting SMART errors already. If one of them has a bad sector I run into 2 issues.

If the storage node would like to read the bad sector my system will get unresponsible with 100% IO WAIT. I am unable to cancel it. I am unable to open an SSH connection. I am unable to restart my machine. Is there a way to tell my system that it should just unmount the bad drive instead of trying to read from it over and over again? Or can I make sure that at least a remote SSH connection is possible and some kind of emergency reboot? I want to be able to solve the issue without having to be at home.
If I am lucky I can restart the machine a short time before the IO WAIT gets out of hand. The next issue is that on startup my machine wants to mount all ZFS drives including the bad one. If it can’t mount the bad drive the system will never finish booting. Only solution is a try and error session with a few hard resets and removing different SATA cable to figure out which of the hard drive the issue has. Is there a trick to tell ZFS to just mount the drives it can mount and let the bad drives error out? I want to be able to always restart my machine and worry about the bad hard drives later.

Toyoo · July 4, 2021, 9:52pm

I am guessing that the kernel hangs because it waits an answer from the SATA device, and it just doesn’t come up.

Neither of your workarounds should be necessary if your drives have error recovery settings set correctly. Then the SATA drive will timeout on errors and send back a proper error message, instead of trying to read bad sectors ad infinitum.

Also, kernel log (dmesg) should report which specific drive is faulty.

donald.m.motsinger · July 4, 2021, 10:11pm

Can you set TLER on these drives?

littleskunk · July 4, 2021, 10:24pm

TLER is set:

# smartctl -l scterc /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.7-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Toyoo · July 4, 2021, 10:30pm

If so, zfs should spend ~7 seconds trying to mount the bad drive, fail and forget about the drive. Something’s really wrong here if this does not happen.

littleskunk · July 4, 2021, 10:30pm

Yea I think a kernal message is the last output I get over SSH. The problem is at that point it is to late to call dmesg. In order to run that command I need a hard reset with the bad drive disconnected. It basically means I can check with dmesg after I removed the bad drive.

littleskunk · July 4, 2021, 10:35pm

I am using OpenMediaVault as operating system. Just in case that makes any difference.

One of my drive was and maybe still is perfect to reproduce it. I was able to boot up the system. With several ls commands I was able to find out that I am unable to read 3 subfolders on that drive. A total of maybe 10 files or so. The system was starting normal. I was able to call ls on any other folder and it will respond just fine. If I try it on these 3 subfolders the command will never finish, I can’t cancel it and the system gets unresponsible. Hard reset was the solution. I copied all data over to a different drive excluding these 3 subfolders and was able to rescue the storage node that way. The issue with the unresponsible system was not resolved.

Toyoo · July 4, 2021, 10:37pm

Ok, I think what I’d do would be booting from USB, not mounting zfs and simply observing whether you can scan the surface of the failed drive with some badblocks-like command with no hangs. That would eliminate at least one source of complexity here.

littleskunk · July 4, 2021, 10:42pm

That is a good idea and should be easy to set up in advance. I would still need to be at home to restart the box. In the first place, I can see from remote what is happening but only a hard reset will allow me to fix it. It would be nice if I could avoid the hard reset.

kevink · July 5, 2021, 5:24am

Strange behaviour indeed… I’m using ubuntu and I don’t have a bad sector to reproduce the exact issue but my esata enclosure (or my esata card) are a bit unreliable and under heavy IO disks tend to get disconnected for up to 15 seconds. During that time IOWait goes up a bit but I can always SSH into the system and it keeps working just fine. HDDs don’t have TLER set.

Other weird thing: When I reboot, those HDDs are not recognized correctly, ubuntu still boots up fine.

Of course, it’s not the same as HDDs with a bad sector…
It’s strange that zfs would cause you that kind of trouble.

SGC · July 5, 2021, 7:09am

ZFS will has internal features that will pause all IO everything in the case of loss of data integrity.

so if you are running on a single disk this maybe very pronounced… not sure… however the features can be turned off…

ofc i have to tell you that they are there for a reason and disabling them is highly likely to damage your data if you continue using the drive… you should clone it to a different drive at a slow pace in read only, maybe using something like EaseUS recovery software…
not aware of any good freeware versions… of similar software.

now for the ZFS features … can’t say i tinkered much with these.
i can’t even find out what they are called…

the primary one is like halt on failure or pause on error something, basically it will stop all IO if you pull a drive from a raid that i non redundant…
the feature can be disabled, like most… tho i’ve been trying to search for it without much luck.

never actually had to use it, it should really only pause all IO if the pool becomes unreadable.
so really when it goes 100% IO because it ZFS paused the pool, just means zfs essentially just saved your data from getting corrupted.

i bet @Pentium100 can remember what the zfs property, feature or whatever it is called is.

oh think it might be this one
https://openzfs.github.io/openzfs-docs/man/8/zfs-wait.8.html

nope…

$zpool get
in the command line, will give you the list of features

zpool get failmode
zpool set failmode
as described when you do zpool get

littleskunk · July 5, 2021, 7:14am

Yes please. I want my data to get corrupted. That is the hole idea here. A corruption of one node is better than all nodes offline for days.

SGC · July 5, 2021, 7:16am

you cannot corrupt data on zfs without disabling a few features
i found the way to disable it and added it to the comment i previously made…

you will need to do a zpool set failmode=continue
then you can corrupt your pools because it will run the data into the ground.

you really should recover the data instead of forcing it to be active / live…

but i would be very interested to hear if it works

SGC · July 5, 2021, 7:23am

let me know how it goes, i haven’t had a reason to try to tinker with that myself… it has saved my pool a great deal of times… like every time i “accidentally” pull a drive to much from my raidz pools.

Pentium100 · July 5, 2021, 9:58pm

It looks to me liek the bad drive is freezing or crashing the SATA controller or backplane. Normally,

It should just time out after a while
It should not affect the ability to access other drives on the system. The system should not freeze, unless the bad drive is also the one used for root or swap. At the worst, zfs may freeze, but if root is not on zfs, it should just continue working.

I had a similar problem in my file server, where a bad drive did somethimg weird that would cause the SATA HBA driver to crash, though I don’t rememeber if it caused the server to just freeze (even though the system drives are SCSI) or get a kernel panic. Replacing the drive fixed the problem, I was thinking that something messed up with the motherboard or such, since the server would work fine after a reboot, until it tried to access some specific sectors on the bad drive.

Craig · July 6, 2021, 4:21pm

Don’t have anything to offer for the ZFS issue, but to help spot the problem drive could you have another ssh session connected that is just tailing dmesg? Then when you get locked in the IO wait hopefully the last message(s) seen there will give you a clue to the problem drive? Assuming the message gets written and transmitted before the system locks completely. See: linux - How can I see dmesg output as it changes? - Unix & Linux Stack Exchange

littleskunk · July 7, 2021, 10:22am

I tested different failmodes and I see no difference. What I can see is that even with the default failmode I am now able to open a SSH connection. → This is a bug and not a feature. It is getting fixed slowly in the brackground. I will keep the failmode continue and hope that his behavior gets better with additional fixes in the future.

SGC · July 7, 2021, 10:54am

the zfs failmode is there to make sure your data doesn’t get corrupted.
but i guess running with it on continue doesn’t make you any worse off than with most regular setups…

have you tried installing netdata, it will give you a ton of useful information on disk performance, just because a disk works doesn’t mean its performs flawlessly nor that it doesn’t create high latency… i’ve been dealing with those issues a lot since i started adding a lot of used hdd’s to my old server… most of the time when i pull a drive it’s not because it makes errors, but because it creates latency and slows everything down if not just freezing the entire system for short periods.

this is particularly problematic because i got live streaming running on the server, so stalls / high disk latency creates a lot of grief.
but it also makes it very easy to actually see how often this seems to happen on older drives.
it can be very random, usually related to disk workload ofc, the more workload the worst it usually gets, and then afterward the errors start…
sometimes it might not happen for a month, and other times it may make the system nearly unusable if its a really bad disk.

temperature also plays a major factor… must be not to hot and not to cold.