Node Failing to come online - bandwidth.db file broken

7tigers · March 12, 2023, 6:37pm

How is this fixed?
My node is dead in the water because this isn’t a self-correcting error, nor does rename and restart storj fix it.

2023-03-12T14:28:49.991-0400	FATAL	Unrecoverable error	{"error": "Error starting master database on storagenode: database: bandwidth opening file \"E:\\\\Storj\\\\bandwidth.db\" failed: file is not a database\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:331\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:308\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:283\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:193\n\tstorj.io/private/process.cleanup.func1.4:377\n\tstorj.io/private/process.cleanup.func1:395\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:92\n\tstorj.io/private/process.ExecWithCustomConfig:74\n\tstorj.io/private/process.Exec:64\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "errorVerbose": "Error starting master database on storagenode: database: bandwidth opening file \"E:\\\\Storj\\\\bandwidth.db\" failed: file is not a database\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:331\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:308\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:283\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:193\n\tstorj.io/private/process.cleanup.func1.4:377\n\tstorj.io/private/process.cleanup.func1:395\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:92\n\tstorj.io/private/process.ExecWithCustomConfig:74\n\tstorj.io/private/process.Exec:64\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75\n\tmain.cmdRun:195\n\tstorj.io/private/process.cleanup.func1.4:377\n\tstorj.io/private/process.cleanup.func1:395\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:92\n\tstorj.io/private/process.ExecWithCustomConfig:74\n\tstorj.io/private/process.Exec:64\n\tmain.(*service).Execute.func1:61\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

SGC · March 12, 2023, 7:25pm

you will most likely need to do something like this.
if you don’t have other ways of fixing it…
stuff like this can happen during migrations, then running rsync again usually fixes it…
but i doubt you are in that group…

no worries tho, your node should be recoverable, first off you most likely want to do a
fsck / chkdsk
to ensure that the filesystem or disk isn’t corrupted, if that is good then you should proceed and follow the instructions in the post linked.

and if you want to be totally sure, give it a bit and let @Alexey weigh in

sadly corrupt databases happen, especially when operating using regular filesystems and single HDD setups.

and no need to worry about the downtime, your node can be offline for 12 days without much fuzz.

7tigers · March 13, 2023, 12:35am

Good to know about 12 days and what’s weird is that that my node runs from an iSCSI slice on a BTRFS array so the redundancy and error correction factor is significantly higher than a single disk and I had just finished a chkdisk too

Definitely going to run backups on those db files now

This KB didn’t help restore the lost data, but the workflow is accurate to scrap all DB files and start over when you get - Error: in prepare, file is not a database (26)

Neither this one - https://support.storj.io/hc/en-us/articles/360029309111

Alexey · March 13, 2023, 3:08am

Unfortunately for error like

there is only one solution

and @7tigers already aware of it.
BTRFS is known as a not good solution for storagenode because of slowness. But in this case it could be a network blip and since you use a network connection, the database file become corrupted. The same may happen for pieces, so monitor your suspension and audit scores as well.

padso · March 13, 2023, 7:37am

Just delete all *.db files und restart the node. You only loose your local statistics.
I tried to recover the db several times, but with no luck so i endet up deleting them.
If you have a lot of free time, try to recover. If you dont want to spend this delete the files

Alexey · March 13, 2023, 8:47am

It’s better to use the tutorial above to do not lose all Stat. You would re-create only corrupted database this way.

7tigers · March 13, 2023, 2:34pm

I didn’t know Storj was so dependent on bandwidth.db to run

Can we get a revision going where it writes to two DB files (local + NAS) and/or a daily rolling file for auto-recovery?

arrogantrabbit · March 13, 2023, 3:50pm

This is not the job of a node to backup databases, and it would not solve the root cause of the failure: unstable storage.

I would recommend addressing that part first.

If your client is windows using built in initiator the issue is in my experience most likely with shoddy unstable network adapters. I would recommend buying used server Intel based adapter from eBay for your client, in case you are using built in adapter now.

Next, if you are using NTFS as a filesystem — run periodic checks. Or, better yet, switch to another filesystem, depending on how you run the node. You don’t need all that complexity for the storage node.

Or run it directly on a storage server, completely avoiding the iSCSI and network.

7tigers · March 16, 2023, 12:34pm

It’s a Windows VM and has been running very well for 2 years until now, and I had just finished a chkdsk, but everything has fallen apart since.

Windows VMXNET > iSCSI > vSwitch > Real switch > Synology BTRFS - 4x6TB disks

I built it that way for flexibility, recovery, speed, and expansion

Alexey · March 19, 2023, 2:26am

Your Synology likely can run the node directly, did you try?

cpare · March 19, 2023, 2:43am

This is my biggest fear being a node operator - can we setup a cron job to copy these files every few hours as a backup, or is this increasing the risk of corruption?

SGC · March 19, 2023, 9:31am

i think the storj team has about a year ago or so changed it so that the most important databases runs with triplicate backup…
most of the other databases are just for stats, calculating earnings, and maybe to work as a parity against the databases of the satellites.

i’m sure the satellites would overrule it in most cases, but just makes sense to keep track of stuff locally also… stuff like that is whats used to estimate earnings, even tho one can do that with fairly simple math

so yeah long story short, i think there was a time where backing up databases could save a node, today i think at worst you will lose stats.

all the databases are in the same storj folder, so if one want to bother doing backup’s of them, it should be pretty straight forward.

copying a live database however… not sure how well that would go, can’t imagine that would always go well… and if it doesn’t whats the point… just wasting more resources on backups that only save local node stats and records.

the satellites keep track of most things, your node should be able to be recovered even with a full loss of all databases.

Alexey · March 20, 2023, 4:00am

No, just needed information about piece is stored inside the piece file. All databases used for statistic now, not a parity for the satellite database.

SGC · March 20, 2023, 9:41am

was referring to this… tho not sure if that ever was implemented, did seem like a good idea to avoid / decrease corruption issues.
yeah looks like something was done in that regard.

i didn’t mean that the satellites used the databases from the nodes, but that it functions like a receipt.

like storagenode logs are sort of a parity / receipt of the satellite logs, which can in case of issues or inconsistencies be compared to track it down…

not saying this happens automatically.

and the point was to attempt to describe why most if not all of the databases on the storagenode can be replaced / deleted without the node dying.

thus making

this pointless…

Alexey · March 21, 2023, 3:07am

This issue is still open, so unlikely it’s implemented.

github.com/storj/storj

Database files get corrupted too often

opened 01:25PM - 06 Oct 21 UTC

konsou

SNO

Reference forum thread: https://forum.storj.io/t/more-robust-databases/15483 …Database images get corrupted too often: ![kuva](https://user-images.githubusercontent.com/27852389/136210571-999ce1d7-5d3a-4a35-ba2d-b35c8cb23075.png) This takes nodes offline and causes user frustration. One possible fix: - Make a backup of the db files every now and then - perhaps once a day - Restore the backup if there’s an error with a db file - Mail the operator that there was an error, backup was restored and they may have lost a bit of non-essential statistics data This way a node will almost never go offline because of a db corruption - and since the db files are non-essential it’s not a big deal if some data in them is lost. A good point about backing up Sqlite by hashbackup: > One other gotcha about SQLITE: if it has an active journal or wal, you cannot make a copy of the db file with OS commands. These two files go together and with OS commands you can’t atomically make a copy of both. You have to use the .backup command in the sqlite3 CLI or the SQLite backup API to get a good copy if there is an active journal or wal.