Feedback regarding recovery process

I just had a power cut on what could best be called a ‘basic node’ made up of a PI 4 and a single 12TB external drive which just hit 11TB of stored data. The power cut caused “database disk image is malformed” errors on restart.

The nature of the system means we can not have a database engine that does all the data protection found in full SQL engines as the I/O costs would badly impact the overall performance, but

  • Can the local node not be designed to take snapshots of the database files once a day and allow a recovery back to that snapshot?

  • The current recovery document could be a lot clearer by

    • not mixing Linux and Windows commands into the same ‘flow’ of steps.
    • the steps listed and all the recommended checks could be provided as a single script.

The overall system design means it is well protected from the loss of nodes, but that does not really help an individual node operator. At the moment there are rather a lot of ‘worse case’ plans in place across at least the EU that detail rather a lot of possible power cuts. Currently, I get the feeling that many node operators may not be able to recover from this type of failure with the tools available.

The change in energy costs also means that a node operator is unlike to restart from zero if they lose their active node, If they calculate the cost of electricity used to run that node. For my situation, my costs have increased by over 3x since I first created my node and the running cost is about 35% of my daily average income from running a node with 11TB of data storage. Starting again would not make much sense.

1 Like

As far as i know, database losing today, not influence on node health. As all metadata is in datafiles itself. So in case of DB error can just delete DBs and node will recreate them on restart.
You will lose only statistics, after restart it will recalculate node data amount also after some time.

That is good news if that is the case, shame it is not clearly stated. On powering up the Docker-based app just errors out with the end user needing to know how to view the logs and the article takes a reader through the recovery process without clearly stating the pros and cons until a footnote at the end.

node operator is not end user, node operator paid for operating, maintaining node. This include that operator should learn how to do it. End user is client who pay money for high quality storage service.

3 Likes

With the way sqlite is used in storage nodes (specifically, using the write-ahead log), you can pretty much copy database files while they’re open as backup. Any standalone copy like that has a pretty high probability of being recoverable, so if you would make a copy like that every hour and rotate last four copies, that would be more than enough to be able to recover data while losing at most few hours.

I’m doing a daily backup. Haven’t had a need to recover from them yet, but a test recovery (i.e. copying data from the backup and just opening the database with a command-line sqlite3 program) was successful.

That said, as Vadim said, these databases don’t really store information that is critical for correct functioning. The two uses where the databases are most often updated are bandwidth statistics and piece expiration data. Bandwidth statistics are just informational. When piece expiration data is missing, the node will assume each piece to be stored indefinitely, waiting for garbage collection instead of being deleted exactly when customer desired, so the worst that happens is slightly increased disk use by such pieces.

2 Likes

I would just add to @Toyoo recommendation - he means backup only databases, the whole node backup is useless - as soon as you restore it and bring online, it will be disqualified for losing customers’ data since backup.
But databases backup is a different story - you may start without databases at all and you will lose only stat, but not the node. So recovering databases from backup will not affect your node’s reputation.

3 Likes