More robust databases

Database corruption seems to be very common:

kuva

Clearly the current implementation is not too robust. Crash (or possibly even a reboot) can corrupt the databases and bring down a node, requiring manual repair.

I would suggest investing time and effort on making the database architecture more robust. This would both lessen the support burden here on forums and keep more nodes online.

3 Likes

Which database architecture do you have in mind? I am not aware of any database architecture that would survie data corruption.

Have we ruled out that the operators configured their system wrong? I would recommend a file system with journaling and disable hard drive write cache. That way the database doesn’t get corrupted in the first place.

1 Like

Yeah, I agree. Node operators who cannot provide even a basic level of storage-level reliability should be wed out from the network.

More data for us, the people who are capable of using the forum search tool.

I’m no database expert so I unfortunately don’t have an answer for that. All I know is that database corruption seems to be a very recurring problem. I’ve had it on my nodes at least twice.

If those are the recommendations they should be in the node setup docs. I just skimmed it all and didn’t find any mention about journaling (well, I guess basically every modern filesystem does that) or disabling hard drive write cache.

It feels unfair to claim that SNOs have “configured their system wrong” if they aren’t given the information about the right configuration in the official docs.

My main point is: lots of people have problems with database corruption. Something should be done to fix this.

2 Likes

I just recently had an error on my oldest node around 30months old now had its very first error “this is not a database”
So what anyone does it I transferred all my database files to windows and repaired them Firstly I tested to see which database was bad on the node itself, Was just easier to repair on windows. This has never happened to my nodes and I cannot explain why it happened but it did. It was working great then I ran a few scripts to check to see how my node is doing which is something I do every single month. I run the successful script to see how well my node has performed for that month to keep track. The issue has never happened before so I cannot explain why it happened.
If I wasnt experienced this would of been much more annoying to fix my node and get back running, But you can pretty much delete every single database file and restart your node to get it back running again.

So on that note it can happen for anyone for unknown reasons. cause this node is on a UPS and has been running for 89days before this happened to me.

Please check configuration for your disk - the write cache should be disabled.
It’s enabled on Windows by default


You should disable it:

In most cases it’s not necessary, however, Windows has a very aggressive update policy and can restart in any time. Sometimes it could corrupt databases when they are used.

While only backups would help in case the hard drive corrupted its data, some databases are less prone to data corruption after, say, an unexpected reboot. For example, InnoDB is much better for this than MyISAM. I do not know how it is with SQLite or the way the node uses SQLite, but it appears to be a high number of corrupted node databases, but, after repairing the databases, the nodes do not fail audits, so it looks like just the database somehow got corrupted.

There are a few ways to cause data corruption on SQLite database. Unexpected reboots are not the root cause. SQLite would still leave its database in a valid state. Most issues are located in the system the database is running on. Using the wrong filesystem can have a huge impact. Even some hard drives lie when it comes to SYNC requests and return success while the data was not written on the disk yet. I am not aware that there is a database that would run without corruption on the worst possible setup.

Not sure if you were replying to me or not but my nodes don’t run on windows there all running in Ubuntu server so write cache isn’t the issue in this case. This node has been running this way for 30 months now without a database error.

I think his message was more for a broader audience ideally all windows nodes that read his message.

1 Like

One idea about a potential fix:

  • Make a backup of the db files every now and then - perhaps once a day
  • Restore the backup if there’s an error with a db file
  • Mail the operator that there was an error, backup was restored and they may have lost a bit of non-essential statistics data

This way a node will almost never go offline because of a db corruption - and since the db files are non-essential it’s not a big deal if some data in them is lost.

You can do that right now with some scripting.
Of course it would be better to have a pull request though.

Why don’t implement a functionality that backups databases automatically to the STORJ DCS network directly into the node code? Then the node can automatically pull the backup in case it runs into an issue. And we would have more traffic for the network :wink:

copy on write is pretty great for data integrity, i never have database issues even tho being fairly mean to my setup from time to time.

most vm’s today is recommended running on copy on write, not sure if that is also in docker or such… but should be on it’s way.

thus switching to such a solution would require very little in actual changes since it’s basically just moving to a different sort of image / virtual partition / filesystem.

SQLite has a WAL file. Isn’t that a copy on write file? It doesn’t help if you run this on a system that just lies and claims to have written it on disk while it is still in the cache. DB and WAL file can get currupted at the same time.

1 Like

SQLite is very good engine, it already has mechanisms in place to deal with lost writes - I don’t think we need the SNO code to be bloated with “fix my node” routines for things that can be scripted - it puts blame to much on Storj, as then if anything went wrong we would all point finger at them, which not helpful.

There are things you can do to help - make sure you on correct kernel - 20.04.03 LTS HWE update will put you on 5.11 (there lots of disk related updates in here, but more aimed at USB3) although 5.4 is very stable and has critical issues backported. If you on pi3/pi4 then there are some quirks in 64bit kernel at moment, that have to be worked around - specifically around kernel memory allocation (it’s good enough for most things, but really suffer on heavy disk I/O, or complex data access patterns)

Write cache on linux is what the internet tells you to look at - there are many kernel parameters to control the disk subsystem - also remember that disks now lie, and can’t be relied upon to have 100% written your data to disk - just because you told a disk to write a block, doesn’t mean the controller hasn’t re-prioritised this, or that the write was executed but failed to update the platter - disks do have failures and without making the disk, write, sync, flush, read for every operation we can’t be certain.

If you have single disk, then maybe as other suggest write script to copy the databases every xx times a day - if you feeling fancy then there very good project to enable master/slave replication for SQLite to another host.

If you have redundant disks, see if your filesystem support metadata checksum ( this slow down IOPS but makes sure no bit rot)

Also look into Filesystems like ZFS (I know very little about this), or storage clusters like Gluster / Ceph / S2D

You should be very happy, 30 months with no errors is very good !

1 Like

i honestly has no clue, just saying if it isn’t copy on write, it could be very advantageous to make it so.

i duno the exacts of how much this is true, but i haven’t seen any issues with it… not sure if my zfs disables my disk cache, it might…

another option could be having multiple copies of the database, so each 5 minutes it simply makes a full copy of a working database and then if the first one breaks it rolls back to the backup.

would save a lot of trouble, and i doubt 5 minutes of lost database time is worse than a broken database.

for a long time i ran with a normal (non Power Loss Protection) SSD as a ZFS Slog device, that alone kept any error from sneaking in after 70 random reboots over a 3 day period.

so having the db on an ssd while using any copy on write solution i think the odds of anything going wrong would be minimal.

maybe that could be another possible path.

Same here. ZFS is designed to deal with the fallout of a powerloss and SQLite will make use of that by just sending the designated requests to the OS to make sure every write operation gets written into WAL and DB file in a way that SQLite can always recover after a power loss. This hole topic is more about setups that don’t use a good file system to the point that SQLite and also any other solution has no way to prevent the data loss. Think about the worst possible setup here. Let’s take a windows PC with write cache enabled. For high speed performance all write requests including the copy on write will just kept in memory for several seconds before windows will write them to disk. I would bet on data loss on almost all power loss situations and even with software that never had an issue in your or my ZFS setup.

Edit: I am not sure if the windows write cache is that evil. There are plenty of other bad setups. So it is just a placeholer here.

1 Like

yeah sync writes should be pretty immediate most of the time, its those async writes that can take minutes

Seems like that would be the only solution so far. I would add that we need at least 2 backups. On the worst possible setup you could still have data corruption right at the moment the backup is created. So we have to assume the DB is corrupt and the latest back is corrupted as well. In that situation just take the previous backup and continue.

@clement what would be the next steps? Create a github ticket?

4 Likes