Another round of disk issues

Since i finally got one of my issues solved everything has been running much smoother, but now without even a reboot i’ve been seeing an uptake in iowait, mainly because on one drive acting up…

been trying to make it behave for weeks if not a couple of months by now…

my problem with this particular drive , which is a ata-HGST_HUS726060ALA640
is that it causes high latency, no real big problems otherwise…

and from tinkering and testing, different controllers and different cables and whatever else i could think of besides taking the pcb of the drive to check for a bad connection…

I’m having issue with it doing load cycles… a lot of load cycles… lots and lots…

Sat 29 Aug 2020 09:22:05 AM CEST
193 Load_Cycle_Count 0x0012 001 001 000 Old_age Always - 440365

Sat 29 Aug 2020 09:22:16 AM CEST
193 Load_Cycle_Count 0x0012 001 001 000 Old_age Always - 440375

tried using hdparm to mess with the APM but seems to have little to no effect
tried hdparm -B
128,254,255 and switch around between then… 255 was suppose to completely disable load cycles… but the hdd just keeps going…

i guess i should try a reboot, but not like that fixed it last time when i change it to 254 which was also suppose to turn it off…
i kinda suspect that maybe it has a bad connection and it’s not actually power management but something like power loss or motor failure that makes the head reset which would explain why the APM settings doesn’t seem to affect anything…

so i just wanted to hear if anyone had some other good ideas before i starte trying to take the pcb off it for cleaning the connections.

moved the disk to a better cooled location, went with hdparm -B 255 to turn off APM
and then had to reboot or power down the machine / hdd completely.

anyway it seems to have fixed the issue, and the disk is running better than ever… initially when trying to fix the issue i had it set on hdparm -B 254 and tried that for a well over a month… because i was warned in more than one place that turning off APM could be detrimental / damaging for some reason that was unclear…

but now when i was at a point that this particular drive was dead in a week if i didn’t find a solution, there wasn’t really much risk…

duno what settings or such related to the hdd that could cause this… i got 5 of those drives and non of the others have presented with this particular problem… i’ve checked the hdparm’s and or compared the smart, and the drive have been moved around in multiple bays and on multiple controllers…

not sure that i can say the problem is fully fixed after 11 hours, but the drive went from 250ms peaks when running good to 1.5s + under not to much load… now it seems to peak about 60ms… so thats if nothing else a great sign… for now…

oh yeah and it stopped increasing the Load_Cycle_Count and Power cycle or whatever it’s called the two which in this drive’s case has the same exact amount of cycles…

192 Power-Off_Retract_Count 0x0032   001   001   000    Old_age   Always       -       442095
193 Load_Cycle_Count        0x0012   001   001   000    Old_age   Always       -       442095

Sat 29 Aug 2020 06:27:22 PM CEST
193 Load_Cycle_Count        0x0012   001   001   000    Old_age   Always       -       442095
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always       -       25 (Min/Max 11/41)

Sat 29 Aug 2020 09:13:37 PM CEST
193 Load_Cycle_Count        0x0012   001   001   000    Old_age   Always       -       442095
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always       -       25 (Min/Max 11/41)

took a while… but the damn thing still does some APM stuff …

192 Power-Off_Retract_Count 0x0032   001   001   000    Old_age   Always       -       471614
193 Load_Cycle_Count        0x0012   001   001   000    Old_age   Always       -       471614

it seems like after running a while it seems like it either runs of APM even tho it’s off

/dev/sdb:
 APM_level      = off

hdparm

seems to me that it has to be a bad connection somewhere since it takes a good while before it starts to act up… almost like it improves when turned off and then after a bit of run time and vibration the connection craps out or workload gets high enough that it pulls to much power which causes a voltage drop and the thing turns off and then spins up again…

so going to take it apart if nobody has a better idea…