Autocheck Database files on node restart with sqlite3

snorkel · January 29, 2023, 10:51pm

Is it possible to implement an autocheck for databases with sqlite 3 "PRAGMA integrity_check" on each node restart and display the results in Dashboard? Like “Databases check OK” or “Databeses check ERROR”? This would be very helpfull in spoting problems with db-es like a malformated db, much much earlier than current situation, when you first see scores dropping, maybe afer a couple of days, and then check the logs for errors, and than stop the node and do the sqlite3 check. The Dashboard is checked more times than logs. The situations that usualy create problems with db-es make the node restart manualy or auto, like a USB connection failure or power outage etc., so a startup check would catch the errors from the begining. The dashboard signal can also trigger an alarm in the monitoring software that some more advanced SNOs use. Maybe integrating third party software in the open source code of storj isn’t a good idea, or is not possible because of licensing etc, but we all use sqlite3 for free anyway, and it is not necessary to use the last available version for db check. So, if there is any possibility, it will be very helpfull.

Alexey · January 30, 2023, 6:10am

You may script this.
I’m against the idea to include any system tool to storagenode software. This is an OS’s job, not storagenode.
The check itself is implemented - your node will crash if the database got malformed or become not a database.

XenonOrion · August 3, 2023, 9:00pm

Would definitely be interested in something like this. But agree that it is not the responsibility of the node container itself to handle this. Have been having a lot of issues for nodes with malformed or corrupted databases this summer, which resulted in quite some manual work. I will see when I get time to script this up to at least automate the analysis and the repair of the databases. If I ever get to it, I can surely share this with you @snorkel

snorkel · August 3, 2023, 9:53pm

It is circulating the ideea to be able to turn databases off, all or a part of them, for who dosen’t want the history. You get rid off malformated dbes, and I/O associated with them.

Alexey · August 4, 2023, 7:22am

It would be better to fix the root cause of their corruption, like using a more reliable power or storage.

XenonOrion · August 5, 2023, 3:08pm

Yes, I agree @Alexey, however it is a mystery to me. These multiple events happend on different machines (different builds), different locations, without any power outage events. And the drives are all enterprise Seagate HDDs, all directly connected with SATA (no USB). The nodes are running in docker, with the recommended --mount command. Some in the past were failing drives, but for these occactions the SMART values look normal. Any idea what I could try, to find out more on the root cause @Alexey ?

@Snorkel I still think it is useful to have more automation, in case when this happens, it is less time consuming :).

Alexey · August 6, 2023, 5:01am

What is filesystem on these disks?
Note that network file systems are not supported. They could work until they stop. The only working network storage protocol is iSCSI.

I’m against adding such disruptive automation to the storage node. It should do only what is it designed for - store and retrieve pieces of segments of files for the customers.
But you may write a script to automate anything what do you want though.

XenonOrion · August 17, 2023, 9:44am

These are ext4 filesystems, directly connected to PCIe SATA card.
When I look at the system’s uptime, the last reboot was months ago, and there haven’t been any power outages. However, looking at dmesg logs, I found some issues, indicating problems with the filesystem:

[13017115.758757] EXT4-fs: 10 callbacks suppressed
[13017115.758775] EXT4-fs (sdb1): Delayed block allocation failed for inode 353971742 at logical offset 0 with max blocks 8 with error 117
[13017115.758805] EXT4-fs (sdb1): This should not happen!! Data will be lost
[13017115.758912] EXT4-fs (sdb1): Delayed block allocation failed for inode 353971746 at logical offset 0 with max blocks 8 with error 117
[13017115.758930] EXT4-fs (sdb1): This should not happen!! Data will be lost
…

I will try to rescan the filesystem with fsck and see if it can identify any issues. What boggles my mind is that these are 4 brand new drives that were fully checked before start of operation, not that long ago (less than a year), and all SMART values are looking great. It is strange to see that multiple drives are showing issues like this. Any ideas where this could come from, @Alexey?

On the automation of recovery:
I agree that the storagenode container should have any additional scope than storing and receiving pieces. At this moment, having a (external) script/tool that automates the recovery could be beneficial, but only because I have not found the root cause of why the issue arises.

Toyoo · August 17, 2023, 11:30pm

This is a clear sign that hardware is failing. This might be a trivial thing, like a sata cable not being fully connected, this might be failing hardware, like a bad sata cable, failing HDD, overheating SATA card. This might be a leftover from a bad system shutdown—if so, fsck would help. This might also be a memory error—this is actually the simplest to test, just run your local copy of memtest86+.

Once I found a bad USB cable by running badblocks—it turned out the cable was flipping a bit every ~6 GB of transferred data.