Do access denied errors (due to filesystem permissions) fall under “missing or corrupted data”? Had a bunch popup after a recent storage system update. Luckily I was looking at the log following the restart.
Also would be nice if SN detected a “data inaccessible” situation to prevent disqualification due to accidents (data directory not mounted, etc.).
Instead of trying to specifically catch errors when disk is not accessible vs real corruption, wouldnt it be easier to just dont worry about differentiation between audit failures and count all of them towards suspension. That way, it does not matter wether you have real audit failure, or transient ones. Either way node is suspended and user can try to fix it. Difference of real vs transient audit failures comes when user tries to fix and lift suspension. user wont be able to fix real audit failures while and node will get DQed in couple days.
I see above approach much simpler to implement and manage - no branching logic, simple to understand and manage.
That’d be great if you could specify for the node to shutdown in case of such errors, some might have temporary storage problems that are more like “node unavailable” than “node unhealthy”. Would be nice have max.audit.errors.to.shutdown in the configuration that would protect the node temporarily in case of some kind of failure.