Node suddenly failing

Alexey · June 25, 2024, 4:44am

Hello @lookitsbenfb,
Welcome back!

Please note - these checks are tests, and increasing their timeout just mean that check would be less effective in the detecting of a real problem - the slow or stalling (dying) disk.

Increasing each of these timeouts for the checks will increase a risk of an undetectable hangs or hardware failures and the node could be disqualified for failing audits.
For example you forced to increase a readable check timeout up to 5 minutes to stop crashes. But it’s also mean that your node would be unable to provide a piece to the customer or to the auditor for the same 5 minutes. And if the node was unable to provide a piece for audit 3 times with a 5 minutes timeout each, this audit will be considered as failed.

If you forced to specify a higher timeout for a writeability check, then this is mean that the node cannot accept pieces from the customers fast enough too, so the success rate will be low, it would have a lower usage and a lower payout.

So I wouldn’t recommend to change these timeouts too much, use 30s steps until the node would not stop anymore. However, if you reached 5 minutes for any of them, your disk likely have bigger issues than just node’s crashes.

Because it’s not expected that disk cannot write a small file even after a minute.

It’s better to do not keep it’s too high as explained above. So, when you finish a defragmentation, you may try to comment out them, save the config and restart the node, then monitor it.
You may also tune the filesystem a little bit more:

and