Put the node into an offline/suspended state when audits are failing

Alexey · July 18, 2022, 6:42am

Since your node is disqualified on all satellites - you can remove its data and identity and start from scratch - generate a new identity, sign it and run a new node with clean storage (check permissions anyway).

hashbackup · July 18, 2022, 3:06pm

On Linux, it might be a good idea to use the ionice command when starting the node software.

WIth ionice, there are 3 scheduling classes: realtime, best-effort, and idle. The default is best-effort priority 4 (0 is highest, 7 is lowest). The node software could be started with something like:

$ ionice -c2 -n0 <node executable path> arg1 ... argn

I’d avoid the realtime class, -c1, as it could lockup your system if something goes haywire in the node software.

As an alternative, it would also be possible to use ionice with a lower priority (higher -n) when doing maintenance commands, like your manual copy operation. I’d avoid using -c3, the idle class, because a process only gets serviced when a disk has been truly idle for a grace period. Instead of getting finished overnight, your copy command would likely take days with -c3.

Since it would be easy to forget to add the ionice prefix on a maintenance command to lower its priority, it makes more sense to use ionice when starting the node software to raise its priority.

SGC · August 2, 2022, 6:24pm

HDD’s and data cables in general are feeble and unstable things, which in generally cannot be relied upon for 24/7 operation, i’ve already seen so many disk errors i would have to count them in the thousands.

this is why stuff like SAS has redundancy for just about everything and even then its run in raid to further reduce the chance of failures and even then … its not safe without good backup practices.

granted we cannot do backups in the case of storagenodes, but we can reduce our chances for failure for older nodes, ofc that also comes with reduced profitability, added complexity and more work.

but sad to say, thats just how it is… redundancy is paramount long term… its not for fun enterprise grade gear often has redundancy on almost every level, or atleast has to ability to do so, even if one doesn’t for performance or other cost reasons.

that being said, would be nice to see a StorjLab introduced feature to give people the chance of fix hardware issues, if its possible.