More robust databases

konsou · September 28, 2021, 6:59pm

Database corruption seems to be very common:

Clearly the current implementation is not too robust. Crash (or possibly even a reboot) can corrupt the databases and bring down a node, requiring manual repair.

I would suggest investing time and effort on making the database architecture more robust. This would both lessen the support burden here on forums and keep more nodes online.

littleskunk · September 28, 2021, 8:21pm

Which database architecture do you have in mind? I am not aware of any database architecture that would survie data corruption.

Have we ruled out that the operators configured their system wrong? I would recommend a file system with journaling and disable hard drive write cache. That way the database doesn’t get corrupted in the first place.

Toyoo · September 28, 2021, 8:23pm

Yeah, I agree. Node operators who cannot provide even a basic level of storage-level reliability should be wed out from the network.

More data for us, the people who are capable of using the forum search tool.

konsou · September 28, 2021, 9:23pm

I’m no database expert so I unfortunately don’t have an answer for that. All I know is that database corruption seems to be a very recurring problem. I’ve had it on my nodes at least twice.

If those are the recommendations they should be in the node setup docs. I just skimmed it all and didn’t find any mention about journaling (well, I guess basically every modern filesystem does that) or disabling hard drive write cache.

It feels unfair to claim that SNOs have “configured their system wrong” if they aren’t given the information about the right configuration in the official docs.

My main point is: lots of people have problems with database corruption. Something should be done to fix this.

deathlessdd · September 29, 2021, 1:09am

I just recently had an error on my oldest node around 30months old now had its very first error “this is not a database”
So what anyone does it I transferred all my database files to windows and repaired them Firstly I tested to see which database was bad on the node itself, Was just easier to repair on windows. This has never happened to my nodes and I cannot explain why it happened but it did. It was working great then I ran a few scripts to check to see how my node is doing which is something I do every single month. I run the successful script to see how well my node has performed for that month to keep track. The issue has never happened before so I cannot explain why it happened.
If I wasnt experienced this would of been much more annoying to fix my node and get back running, But you can pretty much delete every single database file and restart your node to get it back running again.

So on that note it can happen for anyone for unknown reasons. cause this node is on a UPS and has been running for 89days before this happened to me.

Alexey · September 30, 2021, 5:39am

Please check configuration for your disk - the write cache should be disabled.
It’s enabled on Windows by default

You should disable it:

In most cases it’s not necessary, however, Windows has a very aggressive update policy and can restart in any time. Sometimes it could corrupt databases when they are used.

Pentium100 · September 30, 2021, 8:16am

While only backups would help in case the hard drive corrupted its data, some databases are less prone to data corruption after, say, an unexpected reboot. For example, InnoDB is much better for this than MyISAM. I do not know how it is with SQLite or the way the node uses SQLite, but it appears to be a high number of corrupted node databases, but, after repairing the databases, the nodes do not fail audits, so it looks like just the database somehow got corrupted.

littleskunk · September 30, 2021, 9:42am

There are a few ways to cause data corruption on SQLite database. Unexpected reboots are not the root cause. SQLite would still leave its database in a valid state. Most issues are located in the system the database is running on. Using the wrong filesystem can have a huge impact. Even some hard drives lie when it comes to SYNC requests and return success while the data was not written on the disk yet. I am not aware that there is a database that would run without corruption on the worst possible setup.

deathlessdd · September 30, 2021, 9:50am

Not sure if you were replying to me or not but my nodes don’t run on windows there all running in Ubuntu server so write cache isn’t the issue in this case. This node has been running this way for 30 months now without a database error.

littleskunk · September 30, 2021, 9:56am

I think his message was more for a broader audience ideally all windows nodes that read his message.

konsou · September 30, 2021, 3:48pm

One idea about a potential fix:

Make a backup of the db files every now and then - perhaps once a day
Restore the backup if there’s an error with a db file
Mail the operator that there was an error, backup was restored and they may have lost a bit of non-essential statistics data

This way a node will almost never go offline because of a db corruption - and since the db files are non-essential it’s not a big deal if some data in them is lost.

Alexey · October 1, 2021, 5:33am

You can do that right now with some scripting.
Of course it would be better to have a pull request though.

striker43 · October 1, 2021, 6:08am

Why don’t implement a functionality that backups databases automatically to the STORJ DCS network directly into the node code? Then the node can automatically pull the backup in case it runs into an issue. And we would have more traffic for the network

SGC · October 1, 2021, 7:22am

copy on write is pretty great for data integrity, i never have database issues even tho being fairly mean to my setup from time to time.

most vm’s today is recommended running on copy on write, not sure if that is also in docker or such… but should be on it’s way.

thus switching to such a solution would require very little in actual changes since it’s basically just moving to a different sort of image / virtual partition / filesystem.

littleskunk · October 1, 2021, 8:37am

SQLite has a WAL file. Isn’t that a copy on write file? It doesn’t help if you run this on a system that just lies and claims to have written it on disk while it is still in the cache. DB and WAL file can get currupted at the same time.

SGC · October 1, 2021, 10:04am

i honestly has no clue, just saying if it isn’t copy on write, it could be very advantageous to make it so.

i duno the exacts of how much this is true, but i haven’t seen any issues with it… not sure if my zfs disables my disk cache, it might…

another option could be having multiple copies of the database, so each 5 minutes it simply makes a full copy of a working database and then if the first one breaks it rolls back to the backup.

would save a lot of trouble, and i doubt 5 minutes of lost database time is worse than a broken database.

for a long time i ran with a normal (non Power Loss Protection) SSD as a ZFS Slog device, that alone kept any error from sneaking in after 70 random reboots over a 3 day period.

so having the db on an ssd while using any copy on write solution i think the odds of anything going wrong would be minimal.

maybe that could be another possible path.

littleskunk · October 1, 2021, 10:19am

Same here. ZFS is designed to deal with the fallout of a powerloss and SQLite will make use of that by just sending the designated requests to the OS to make sure every write operation gets written into WAL and DB file in a way that SQLite can always recover after a power loss. This hole topic is more about setups that don’t use a good file system to the point that SQLite and also any other solution has no way to prevent the data loss. Think about the worst possible setup here. Let’s take a windows PC with write cache enabled. For high speed performance all write requests including the copy on write will just kept in memory for several seconds before windows will write them to disk. I would bet on data loss on almost all power loss situations and even with software that never had an issue in your or my ZFS setup.

Edit: I am not sure if the windows write cache is that evil. There are plenty of other bad setups. So it is just a placeholer here.

SGC · October 1, 2021, 10:41am

yeah sync writes should be pretty immediate most of the time, its those async writes that can take minutes

littleskunk · October 1, 2021, 10:56am

Seems like that would be the only solution so far. I would add that we need at least 2 backups. On the worst possible setup you could still have data corruption right at the moment the backup is created. So we have to assume the DB is corrupt and the latest back is corrupted as well. In that situation just take the previous backup and continue.

@clement what would be the next steps? Create a github ticket?

clement · October 1, 2021, 11:20am

Yes. We should have this as a GitHub issue. This is an important topic and I don’t want us to lose it.

@konsou, do you mind opening this as an issue on our GitHub repo and adding your potential fix here: Sign in to GitHub · GitHub ?