Nodes autorepair

I truly think Nodes should fix themselves whenever possible.

Let’s take database corruption as an example: it is a common problem and is likely to keep happening in the future to many SNOs as a simple power outage can cause this issue.

There is a known procedure (https://support.storj.io/hc/en-us/articles/360029309111?_ga=2.6254707.1330931111.1590396912-5329027.1585062953) for fixing corrupted/malformed databases.

The Node could run a pragma check on its database files regularly (or when a Node just got started after a reboot, or an update), or whenever something goes wrong with it.
If a database file proves to be broken, the Node could pause (disconnect from the sat’) and take the time to check and repair what can be repaired.
If the repair fails, then the SNO should be warned, and the node suspended I guess.


More globally, I think that whenever a problem has a known solution that could be applied by the Node software itself automatically, this should be implemented.
That would greatly improve SNOs’ peace of mind, and also the STORJ network quality/reliability, I think :slight_smile:

This is somehow related to Ability to recover part of lost data / restore the node from backup

+1 Just the same as I was thinking, it is one of the bigger problems that there are right now, a few days ago I lost a node of 10 TB due to an error in the DB.

Also the held amount is used to repair all the nodes the files when actually only a part is damaged …

Cheers

1 Like

You can lose DB and do not lose the node:


You lost node for too much audit failures, i.e. missed or inaccessible customers’ data.
The data lose can’t be recovered itself.

I also think this is a great idea.

I have lost many many TB’s of data stored on old nodes due to database and other random errors and my impatient and failed attempts to fix them.

Why can’t we use repair function on a node that has lost data?

I would rather pay for all the repair traffic (out of my esrow) than lose 100% of the escrow.

The losing database will lead only to lose the stat, not the customers’ data. Seems you have a mass disruption problem there. And if you have several failed attempts I would suggest to check your hardware more precisely, perhaps it’s not suitable to handle data anymore.

This is not possible because of simple reason: