I have a 4 TB node. The Seagate drive gave a timeout in logs. I promptly moved to another drive.
(… this also guves a 100% failure rate of all Seagate desktop drives I’ve used for 24/7 operations so far; which is 8 in total. Some smaller for day to day usage are fine still.)
I now have ~3+ million files copied off to a new drive, and the node runs again. However - 5 of the .sj1 data blobs couldn’t be copied off. I use btrfs on this node, so I know the other files are good.
Should I toss the node based on the 99,99998+% availability of the other objects? Or should I wait to have it DQ’ed based on a validation of exactly those files? It is unlikely to happen shortly, but might eventually.
The files owner will get his files back based on the erasure coding, so the node not a huge liability to the network; but it’s not a perfect node anymore.
Should the project have a object-level resilvering function where a node operator could request - which he pays for himself - to have those files repaired back on his node to get it back to 100% state, based on the operator knowing which files to ask the satellite to repair on his node?
Should nodes have a ‘resilvering function’ to pay for repair back on own nodes, based on the satellite knowing which files should be there?
(Also - what would you do in my situation?)