Storage node preflight database error: file is not a database

Pac · August 15, 2020, 12:45pm

After a disk unmouting, I’m currently facing a similar issue with 3 of my nodes that have got database files that error with the following message when checking their PRAGMA integrity:

Error: file is not a database

I followed https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed- but it does not solve the problem, as generated “dump_all” files all contain the following:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
/**** ERROR: (26) file is not a database *****/
ROLLBACK; -- due to errors

And so, the final step which is supposed to fix the file creates an empty file, as faced by @bar1 previously:

In my case we’re not talking about 1 file though, most of them are in this state:

/!\ The following database files are KO:
/home/pi/storj/mounts/disk_3/storj_node_4/revocations.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/heldamount.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/notifications.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/piece_expiration.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/pieceinfo.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/piece_spaced_used.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/pricing.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/reputation.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/satellites.db
/home/pi/storj/mounts/disk_3/storj_node_4/storage/storage_usage.db

So, yeah… Am I doomed? ^^’

Pac · August 15, 2020, 1:10pm

What happens if we delete all *.db files? Statistics will surely be lost for the current month, but will the node software recreate them in a clean way, so it can start working again?

SGC · August 15, 2020, 1:42pm

the storagenode can be recovered without the db files, but you should try to avoid it, as it’s not a great solution and have some possible detrimental effects with it…

i really don’t know what the 0kb files mean…
doubtful that its a good thing tho…
are you sure the system actually can access the location where the databases are stored?
it’s not just some sort of shadow or files storj created in the mount point folder because it was empty due to the mount not having connected

can you see the blobs files? and do they actually take up space?

also when configuring linux or moving disks around the /dev/sdX definition is quite loose … so if you used this to define which disks you are using then it can drift and other disks can take the name / definition instead…

so like /dev/sda is now /dev/sdc

i personally used /dev/disk/by-id/… instead

thats based on the disk serial number and type of disk, and thus will never change even if you move the disk

Pac · August 15, 2020, 1:58pm

Hey @SGC,

The system can access the location of blob files, yes. ls does report them as taking up space.
I see them, and they are in the right directory.
The disk seems correctly mounted, it’s listed in df -H -t ext4 and targets the right folder.

All my disks are statically mounted via fstab, and I do use UUID too, if that’s what you mean.

Is that so? How can that be done? What are the drawbacks?

BrightSilence · August 15, 2020, 1:59pm

File is not a database errors can’t be repaired. The revocations.db isn’t an sqlite file, so you can ignore that one. It’s likely fine. As for the others, with that much damage you might as well start with a new set of databases. If you remove them all the node will create a new empty set. You’ll be missing a lot of stats and a small amount of payout because of unsent orders. But you’ll live to SNO another day.

Also… Turn off write caching in RAM.

SGC · August 15, 2020, 2:11pm

you might get some extra lost data hiding in your node… but maybe that will be fixed some day in the future… the important thing is that the node survives…

as to how it’s done… i barely got a clue

rebuilding the storagenode databases, is the kind of stuff @Alexey and @BrightSilence is very familiar with

yeah uuid is fine… might tbe the same thing… not sure there are like 16 different names for each device in linux it seems lol

Pac · August 15, 2020, 2:13pm

I just removed all *.db files (except for the revocations.db one).

The node starts again!
Awesome!
Thanks @BrightSilence & @SGC.

Questions:

Can the revocations.db file be safely deleted anyway, to be sure it gets re-created cleanly as well?
About “Turn off write caching in RAM” => Are we talking about disabling “write-caching” as described in this article: https://linuxconfig.org/improve-hard-drive-write-speed-with-write-back-caching using hdparm?

SGC · August 15, 2020, 4:18pm

hdd caching is in 99% of all cases good to have turned on… only if you know a specific reason why it should be off can it be worth to turn it off… tested it with my zfs pools doing storj … didn’t help me atleast…

the article talks about turning it off to get increased data security… which might be true… but the performance loss is also quite significant… optimizations such as NCQ will not work because the disk has no clue about what its doing…

also there are in general two types of writes to a disk… async and sync… the sync writes will wait for acknowledgement that the writes are stored on disk, to ensure integrity for things like databases…
and most database writes should be sync writes… so i doubt there is much security gained from it…

it’s a very difficult topic and could in theory vary between manufactures or disk classes…

the async writes can wait in memory for a good while and then later end up in cache on disk waiting on being written… this ofc exposes them to degradation because the data is only stored in one location. and without parity bits… atleast to some higher degree of redundancy… a bit flip would most likely be corrected, but you can have like thousands of those over a few weeks or months… so eventually a couple will be in the location so an error creeps through…

i would say if you experience enough power disruptions that you need to consider this then you should consider a small ups to ensure a graceful shutdown instead…

ensuring complete data integrity is very difficult and requires multiple layers of redundancy…

ECC ram is a good place to start, but the issues with normal RAM isn’t to bad, but will damage a file from time to time… and the same goes for disk storage… a harddisk makes mistakes… quite often actually… and stuff like this will eventually damage critical stuff… and then depending on the programming or data’s redundancy one might see degradation… mostly on our regular computers, dealing with mostly media… a wrong bit here and there isn’t going to make much of a difference…

but for storj a damage file / piece is most likely a damage piece and will be failed… even if its just a single bit thats off…

thus one will need something like checksums or raid6 to ensure the hardware can locate the errors on the storage solution.

Pac · August 15, 2020, 4:26pm

@SGC Thanks for all these details.

Would be worth having @BrightSilence’s and @Alexey’s take on this, because I’ve seen them suggesting to turn off write caching quite a few times throughout the forum. There must be a reason… ?

The UPS is obviously an excellent idea/solution. I did not plan on putting money into this though. Maybe later if my nodes get me enough to pay one, I’ll consider it

SGC · August 15, 2020, 4:29pm

turning write caching off is basically never a good idea… i don’t care who says it…

the performance drop is massive… it’s in some cases not very noticeable, but in others it may cause performance to creep down to like 1/4 or less when working with high IO workloads.

i would be pretty confident that they must be talking about some other form of caching… there are many many layers of caching in computers… just your cpu alone will have 3+ levels of cache

BrightSilence · August 15, 2020, 5:31pm

That article talks about the disks cache itself, which likely wouldn’t have been an issue in your scenario, since you didn’t unplug the drive. But with a power outage it carries the same risk. My guess is that your disk writes were cached in system RAM as well. No database system can protect against sudden disruption when that’s the case. But turning it off will also have an impact on performance. So it’s up to you. Both RAM and disk cache are volatile, when they lose power the data in it is lost.

Here’s the upside though. While there is a small chance this will have impacted a piece or two on your node as well, you’ll almost certainly never find out. Even if you have 10 corrupted pieces that should be on your node right now, the chance of those getting audited is so low, it likely never happens. And even if it does, it’ll be a single failed audit and then nothing but successes again. Your node will definitely survive that.

As for the revocations.db, I think you can safely remove it, but I’m not entirely sure. I would leave it be, the chances that file is damaged are minimal. Doesn’t seem worth the risk.

Pac · August 15, 2020, 6:39pm

Alright, well then maybe it’s not that important to disable write caching. Power outages are not that frequent, and disk disconnections are simply not supposed to happen… I got unlucky with the new disk I plugged to my RPi 4B. We’ll see if this happens again in the future, it’s been stable for 4 days now.

And now that I now that a node can recover even without database files (which by the way is an incredible feature, kudos to the developers!), and as you mentioned that losing one or two pieces is not a big deal, maybe it’s better to keep the write caching for performance reasons. Especially for SMR drives… I would guess?

By the way I was wondering, when an audit fails, I guess the corrupted piece is removed from the node, right? I mean, it cannot cause another failed audits in the future?

SGC · August 15, 2020, 7:55pm

SMR drives have particularly large cache’s and thus are more susceptible to short term data loss when disconnected, but they have large cache for a reason…

one possible way to ensure the databases is to place them on a couple of mirrored SSD’s with PLP
but 1 SSD with PLP or even without it would greatly limit the write time, and thus limit the possible data lost on a power failure… it’s partly that and CoW (Copy on Write) thats my power loss protection.
then even if data is lost, the database would simply be a few miliseconds out of date.

which i would gamble that 9 times out of 10 shouldn’t matter… copy on write does come with a bit of extra overhead tho… but it does solve a big problem in data management and in regard to disconnections and power loss…

like say you have a database entry and is being rewritten… then it will copy / rewrite the data in a new location, then it will start to change pointers that is the location of the data for the file essentially…
it then goes through a sequence of these pointers that gets corrected until it’s corrected the top tier one and when its flips that bit, the file instantly goes from being the old version when read to being the new version… due to the pointer being changed… if power is lost during the process the pointer will remain unchanged and point to the old version of the file…

in the old system a rewrite of a file basically just overwrites it and thus if it goes wrong during the process you got half updated file and half the old file… which usually leads to a file not making sense to the computer.

Copy on Write file systems will most likely be what everybody uses at some point… unless if something better comes along… i will never go back thats for sure…

PLP for SSD’s are Power Loss Protection… basically it means the SSD has a tiny capacitor with enough power to allow the SSD to do a graceful shutdown and write it’s volatile cache to it’s non volatile memory.

that also mitigates a big issue with data loss on power outages… then next there is ram… which you can actually get with batteries so they can keep the memory to be stored correctly after a power loss… duno much about that version tho…

interesting stuff tho… it’s basically another version of the PCIe IO accelerator technology which can increase cpu performance and also to my understanding with the right setup provide some additional levels of power loss protection for “ram” almost… not quite sure how exactly thats suppose to work tho…

by far the easiest solution is a tiny tiny UPS that has like minutes worth of power and then just shuts down the system on power loss…

but CoW is basically a software solution to a hardware problem or it helps a lot

Pac · August 15, 2020, 8:13pm

@SGC I really appreciate your insights and the time you spend for sharing your knowledge

I’ll be honest though, there are many concepts you’re talking about that are way beyond what I understand in the field of system administration ^^

My goal when starting the Storj adventure was to participate in a great idea/project, and make a bit of money on the side, but I honestly thought it would be a set and forget it software, running on its own on a modest pi somewhere in my living-room

It’s been a long way since my early days in the adventure, I now have written a 500+ lines shell script for managing my nodes, have set-up cron jobs, a bit of monitoring and learned a lot about disks, shell scripts and the linux system, the RPi 4B platform and its limitations, and probably many other things that don’t come to my mind right now…

But STILL, even though I’m glad to learn all these advanced concepts exist, I think a regular SNO shouldn’t have to worry about these incredibly complicated considerations professional hosting services have to handle

That’s why I think even if it doesn’t solve everything, your conclusion is pretty neat:

Yes! That would have saved me a few sweats

BrightSilence · August 15, 2020, 8:31pm

Ooh, yes probably a bad idea to disable on those. Wouldn’t be surprised if that slowed it to a crawl.

You would think so, but no. Audits are only there to determine whether your node can be trusted. Because such a tiny fraction of data is ever audited it makes no sense to actually fix the issues it finds, because for any missing or corrupted piece it finds there are statistically thousands of others.

SGC · August 15, 2020, 8:36pm

it’s always like that with new tech… you fix one problem to discover a new one or the solution creates a new problem… but yeah i totally agree it should be set and forget… sadly i’m not sure there exists an option for that… but we are getting closer every day…

yeah i didn’t know how much i didn’t know about storage before i ran into the storj project lol
took me maybe 18 months to experiment and settle on what kind of a setup i wanted to go with… and that didn’t keep me from taking it all apart and remaking it like 4 times in the first 5 months lol

think i’ve copied 100tb during the last 6 months… weeks of pain … lol
just today i had to shut it down and try to figure out what there was wrong with one of my hdd’s, but sadly looks like it will be dying in the near future…

i focused mostly on the hardware side thus far… and expect to be running a couple of nodes maybe 3 sizable ones… just to have a couple of spares …

will also be rebuilding my data storage pool before or after xmas… depends on how long i can make it last before i’m forced to get more storage… no plan ever survives contact with the enemy … xD

Pac · August 15, 2020, 10:39pm

I don’t see why the node should keep a corrupted piece eating up useless space? Even if it’s just a few MB. That’s very weird. If the piece was only partially written because of a power outage for example, it could be invalid even though the disk surface isn’t faulty. Once the sat’ spotted this piece is wrong, it should take actions to replace it somewhere else if necessary, and remove it from this node as it’s invalid. And decrease a little bit the node’s reputation, sure. But as you said:

Still not a reason to leave garbage on the node would be my take
I mean what is the satellite expecting? this corrupted piece is probably never going to fix itself…

BrightSilence · August 16, 2020, 2:56am

Yeah, this is where humans tend to fail at statistics. It intuitively sounds weird. But this month on my node I’ve had 25000 audits. In that same period of time I’ve received 800000 uploads. And obviously there are many more pieces on the node from previous months. Satellites will never audit a large enough fraction of your data for it to be useful to fix the problems it finds. All it does is determine statistically whether your node can be trusted enough.

The moment the file size of corrupted pieces starts to matter even a little bit is way beyond the moment your node should be disqualified. And besides, your node not having the piece is your fault or your hardware’s fault. And despite that, you’ll still be paid for it.

I can already predict the response that this still doesn’t make sense. But statistically speaking, it does.

SGC · August 16, 2020, 5:56am

@BrightSilence
Blasphemy who doesn’t check all their data on a regular basis

pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 1 days 03:00:11 with 0 errors on Sun Aug 16 00:46:21 2020
config:

        NAME                                             STATE     READ WRITE CKSUM
        tank                                             ONLINE       0     0     0
          raidz1-0                                       ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH2JDXB      ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH21JAB      ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31021EH1P62C      ONLINE       0     0     0
          raidz1-2                                       ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_531RH5DGS             ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_99PGNAYCS             ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_Z252JW8AS             ONLINE       0     0     0
          raidz1-3                                       ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31051EJS7UEJ      ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31051EJSAY0J      ONLINE       0     0     0
            ata-TOSHIBA_DT01ACA300_99QJHASCS             ONLINE       0     0     0
        logs
          ata-OCZ-AGILITY3_OCZ-B8LCS0WQ7Z7Q89B6-part5    ONLINE       0     0     0
        cache
          ata-Crucial_CT750MX300SSD1_161613125282-part1  ONLINE       0     0     0

errors: No known data errors

took a long time this time, most likely because of the hdd i got that is dying…

seems to have settled back to normal operations now that my scrub is done…
still think i’ll have to replace it tho…but i guess i can be happy that i only have to shuffle some disks around and still doesn’t need to go buy any…

i think usually a scrub has been 8 - 12 hours broke the record last time with 8 hours i think because the system had been online during the previous scrub and thus all the metadata was all still in the l2arc 14 days later

BrightSilence · August 16, 2020, 11:31am

You really need your own topic to talk about your setup. This wasn’t a response to anything.