What's the best course of actions if I start seeing HDD errors

SlavikCA · July 15, 2023, 3:41am

I’m running few SNO via Docker on my Synology.

I noticed, that one of my node restarts every few mins. While looking at the logs I found errors:

ERROR contact:service ping satellite failed … “ping satellite: check-in ratelimit: node rate limited by id”

ERROR piecestore:cache error getting current used space: {“process”: “storagenode”, “error”: “filewalker: readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/bw: input/output error”…

ERROR services unexpected shutdown of a runner {“process”: “storagenode”, “name”: “piecestore:cache”, “error”: “filewalker: readdirent config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/bw: input/output error”

INFO exited: storagenode (exit status 1; not expected)

WARN received SIGQUIT indicating exit request

I have 8 TB Seagate Barracuda made in 2020

I checked SMART health:

5: there are 0 “relocated sectors”
197: there are 72 “pending sector”
198: there are 72 “offline unrecoverable errors”

Not sure, what I can make from these SMART data, but I suspect drive is dying. But may be it can work for quite some time…

So, I disabled filewalker and looks like it’s running without restarts.

I wonder, if I should keep it running with disabled file walker for as long as it can? Or should I start graceful exit?

Alexey · July 15, 2023, 3:55am

Right now you need to stop and remove the container and run the check and fix for the filesystem.

After that you may run the container back.
You may also migrate this node to a healthy disk if you have it:

You may try to run a Graceful Exit, but only after filesystem fix. However, GE may fail with disqualification, if your node could have more than 10% of failed transfers.

SlavikCA · July 15, 2023, 4:39am

That HDD attached to Synology via USB. So, low-level access to the drive is limited. It was working fine, when it was working, but troubleshooting via USB is limited. And it’s ok, - I knew about that limitation.

By the way, here is my uptime log for that node:

uptime

YELLOW: 1 failed check (no response)
RED: 2 or more failed checks (no response)

Will see, if it gets greener now without that filewalker.

Alexey · July 15, 2023, 7:10am

If it has NTFS, you may stop the node, safe eject it and connect to Windows to check the filesystem. By the way, please consider to migrate to ext4 in this case. Unfortunately it’s not possible to do in-place, so you need to backup all data first then format it to ext4.

And as far as I know, you can enable ssh and login via ssh to the shell, where you have sudo, you should not mount this disk and then you can check and fix the filesystem on this drive.

snorkel · July 15, 2023, 9:50am

Check this too see what’s coming…

https://forum.storj.io/t/so-long-tnx-for-all-the-fish/23218

So it’s SMR drive and USB connected; 2 bad choises.
The smart move here would be:

stop the node and check the drive on a PC if Synology gives you headache. Even if it’s ext4, you cand enable support in windows for it.
get a nice Enterprise level drive of at least 8TB, CMR.
move the data to it.
Being the state of the drive that maybe will start degrading quicly or die, you should use the most rapid way: put both drives in a PC, internal cabling, and clone it, after checking and repairing the filesystem and marking the bad sectors. But this depends on what filesystem you are using now and what you will use on the new drive. And where you intend to put it.
Keep in mind that from what we saw in this forum, NTFS is strugling with big drives used for storagenodes (SN); best fs in our case is ext4, and no RAID. If you go after a new drive, look at Seagate Exos 16TB and above and Toshiba MG10. If you have UPS, Seagate is much quicker. If you don’t, Toshiba is better because can take sudden shutdowns, but is slower.
Also consider to stop using the external hdd case, and do all the checking and moving from a pc, with internal sata cabling. Maybe the external case is dying to.

Alexey · July 15, 2023, 12:27pm

Doesn’t matter is it Enterprise or not, but a final thing to consider - CMR. It would work perfectly until die.

…under Linux

Other recommendations are not applied to a Synology NAS (sorry, @snorkel)