Storagenode Recovery Mode

BrightSilence · May 17, 2020, 7:21am

Lots of mixed discussion going on here. Lets try to separate a few things out, starting with database corruption that has been mentioned a couple of times.

As far as I’m aware databases hold only non-vital information. That data is not necessary for your node to function. Currently the only way to fail audits because of the databases is when the databases are corrupt and can’t be accessed at all. These errors would fall under the unknown audits failure category and wouldn’t lead to disqualification, but to suspension instead. Suspension already gives the SNO time to recover from this error by either repairing the db, restoring a working db backup or starting over with an empty db with the right schema. So there is already a way out of db corruption that doesn’t involve any lost data and your node can be recovered. If there still is some db that is currently vital for audits (which I don’t believe is the case to begin with) then THAT should be fixed. We don’t need an additional method to recover from that.

2 scenarios left. Either the node has lost part of its data or it lost everything.

If the node has lost part of the data. Say 1%. It would be nice if it could continue on with the 99% that is still there. But I really don’t understand why everyone is so bend on restoring the missing data? The only way to restore that data is by repair and most of those pieces would not have hit the repair threshold. So that means wasting money on downloading 29 pieces to restore each 1 piece on your node. It’s incredibly wasteful and the obvious option is much simpler. The node reports to the satellite what data is lost. The satellite marks the pieces as lost. And everyone moves on. The node would be punished by missing data that they used to have and the related income from that data. But the tradeoff is that it will no longer get audited for that data and gets to live on. Restoring the lost data shouldn’t even be considered in this scenario as it is of literally no benefit to the network. Repair will trigger automatically for pieces that need repair, through the existing systems.
Now this leads to a few challenges:

How does the node determine which pieces were lost, without super expensive checks from the satellite?
How do we prevent nodes from using this system to cover actual weaknesses in the hardware. If an HDD starts failing the node can scan the HDD and keep reporting new pieces as lost, while in that scenario the entire node should no longer be trusted.
How do you prevent cheating with this system by the node simply reporting all files that are audited as lost?
How do we pay for the repair that eventually needs to happen because of this? Held amount? Do we build up part of that held amount again?

The last scenario is that the node lost all files. Restoring this from the network runs into the same problem again, that it’s highly costly. It means downloading 29 pieces to recover 1. So the previous $450 per TB wouldn’t even get close to covering the cost. And sure you can use that repair to restore more pieces, but a lot of these segments would still likely have more than enough pieces on the network to not be even close to needing repair. So there is no remotely affordable way to restore the data. This node has also shown not to be reliable. So, it can claim that the problem was fixed, but we can’t simply trust that, so it needs to be vetted again. The data was also lost, so we need to use the held amount for repairs. So what would we need to do to keep this node in the network. Empty it, vet it again, use the held amount and build up new held amount. So… basically start a new node. So the choice would be, use the same identity and have that failure hanging over your head or just start over. I’d say SNOs are better of just starting over.