Piecestore monitor: timed out after 1m0s while verifying writability of storage directory

Vadim · June 13, 2023, 10:28am

On some moment i started to get this messages and node gone OFF after them.
I looked ridicules, because it worked ok, no other problems with node.
There was no penalties from audit score, only online score gone little down because it was off time to time.

Only when I decided to change HDD, When i started copy files, I realized that file copy working wrong. HDD 100% active put no files moving for long time, only time to time files moving normally in 4 days I copied around 200GB only. during this copy i see around 160 error copy files - 100MB… HDD started to show Current Pending Sectors 1600+

I checked connections and power, changed SATA and POWER port.

Now i have several questions.
1)Why Audit didnt discovered than so big data amount not readable. It seams we can have problem with audit system.
2)Is there anything I can do to help make it better?

Toyoo · June 13, 2023, 10:30pm

I’ve seen it already several times that a hard drive would last a long time barely used, then fail only after putting lots of load in a short time span. Given that modern HDDs are also rated on amount of data read/written I suspect that it’s not the storage surface/platters fault, but some part of the actuator mechanism, maybe the heads themselves, might simply have a limited workload rating. Maybe it can last a long time if lightly used, but quickly break apart shortly after starting some heavy operations, like your file copy.

This is just a hypothesis, and I’m not knowledgeable enough to prove or disprove it. However, it would explain that audits were fine.

If this hypothesis is true, there’s not much you can do, except to hope that a recovery tool like ddrescue can still work on your drive. If errors are indeed at random, then re-reading faulty sectors have a chance of reading correct data at some point. Not sure if this is worth the effort for a Storj node, though it’s probably also a nice testbed for trying out recovery tools on low-value data.

JWvdV · June 14, 2023, 2:05am

Sorry mate, but your hard disk broke. So probably is not worth the labour, unless you repaired drives before or do see it like a fun project. In each case, this node probably died. The data loss probably exceeds 5%, so the node will get disqualified within time.

See also https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/.

Especially the pending sector score, is ominous…

You can always try to repair or format the drive and see whether you can write and read back the whole drive. But probably this is a hardware failure.

Alexey · June 14, 2023, 2:31am

Audits doesn’t check the whole data, only randomly selected pieces of the randomly selected segment. They are exist to help the satellite to make a decision, should it trust your node or not.
However, if it detects that more than 4% are not readable - your node will be disqualified, see a summary:

I can assume that some not readably pieces can be read after 3 attempts with 5 minutes timeout, otherwise your audit score would be affected.