Node Failing to come online - bandwidth.db file broken

How is this fixed?
My node is dead in the water because this isn’t a self-correcting error, nor does rename and restart storj fix it.

2023-03-12T14:28:49.991-0400	FATAL	Unrecoverable error	{"error": "Error starting master database on storagenode: database: bandwidth opening file \"E:\\\\Storj\\\\bandwidth.db\" failed: file is not a database\n\*DB).openDatabase:331\n\*DB).openExistingDatabase:308\n\*DB).openDatabases:283\n\\n\tmain.cmdRun:193\n\\n\\n\*Command).execute:852\n\*Command).ExecuteC:960\n\*Command).Execute:897\n\\n\\n\\n\tmain.(*service).Execute.func1:61\n\*Group).Go.func1:75", "errorVerbose": "Error starting master database on storagenode: database: bandwidth opening file \"E:\\\\Storj\\\\bandwidth.db\" failed: file is not a database\n\*DB).openDatabase:331\n\*DB).openExistingDatabase:308\n\*DB).openDatabases:283\n\\n\tmain.cmdRun:193\n\\n\\n\*Command).execute:852\n\*Command).ExecuteC:960\n\*Command).Execute:897\n\\n\\n\\n\tmain.(*service).Execute.func1:61\n\*Group).Go.func1:75\n\tmain.cmdRun:195\n\\n\\n\*Command).execute:852\n\*Command).ExecuteC:960\n\*Command).Execute:897\n\\n\\n\\n\tmain.(*service).Execute.func1:61\n\*Group).Go.func1:75"}

you will most likely need to do something like this.
if you don’t have other ways of fixing it…
stuff like this can happen during migrations, then running rsync again usually fixes it…
but i doubt you are in that group…

no worries tho, your node should be recoverable, first off you most likely want to do a
fsck / chkdsk
to ensure that the filesystem or disk isn’t corrupted, if that is good then you should proceed and follow the instructions in the post linked.

and if you want to be totally sure, give it a bit and let @Alexey weigh in

sadly corrupt databases happen, especially when operating using regular filesystems and single HDD setups.

and no need to worry about the downtime, your node can be offline for 12 days without much fuzz.

Good to know about 12 days and what’s weird is that that my node runs from an iSCSI slice on a BTRFS array so the redundancy and error correction factor is significantly higher than a single disk and I had just finished a chkdisk too

Definitely going to run backups on those db files now

This KB didn’t help restore the lost data, but the workflow is accurate to scrap all DB files and start over when you get - Error: in prepare, file is not a database (26)

Neither this one -

Unfortunately for error like

there is only one solution

and @7tigers already aware of it.
BTRFS is known as a not good solution for storagenode because of slowness. But in this case it could be a network blip and since you use a network connection, the database file become corrupted. The same may happen for pieces, so monitor your suspension and audit scores as well.

Just delete all *.db files und restart the node. You only loose your local statistics.
I tried to recover the db several times, but with no luck so i endet up deleting them.
If you have a lot of free time, try to recover. If you dont want to spend this delete the files :wink:

It’s better to use the tutorial above to do not lose all Stat. You would re-create only corrupted database this way.

I didn’t know Storj was so dependent on bandwidth.db to run

Can we get a revision going where it writes to two DB files (local + NAS) and/or a daily rolling file for auto-recovery?

This is not the job of a node to backup databases, and it would not solve the root cause of the failure: unstable storage.

I would recommend addressing that part first.

If your client is windows using built in initiator the issue is in my experience most likely with shoddy unstable network adapters. I would recommend buying used server Intel based adapter from eBay for your client, in case you are using built in adapter now.

Next, if you are using NTFS as a filesystem — run periodic checks. Or, better yet, switch to another filesystem, depending on how you run the node. You don’t need all that complexity for the storage node.

Or run it directly on a storage server, completely avoiding the iSCSI and network.

1 Like

It’s a Windows VM and has been running very well for 2 years until now, and I had just finished a chkdsk, but everything has fallen apart since.

Windows VMXNET > iSCSI > vSwitch > Real switch > Synology BTRFS - 4x6TB disks

I built it that way for flexibility, recovery, speed, and expansion

Your Synology likely can run the node directly, did you try?

This is my biggest fear being a node operator - can we setup a cron job to copy these files every few hours as a backup, or is this increasing the risk of corruption?

i think the storj team has about a year ago or so changed it so that the most important databases runs with triplicate backup…
most of the other databases are just for stats, calculating earnings, and maybe to work as a parity against the databases of the satellites.

i’m sure the satellites would overrule it in most cases, but just makes sense to keep track of stuff locally also… stuff like that is whats used to estimate earnings, even tho one can do that with fairly simple math :smiley:

so yeah long story short, i think there was a time where backing up databases could save a node, today i think at worst you will lose stats.

all the databases are in the same storj folder, so if one want to bother doing backup’s of them, it should be pretty straight forward.

copying a live database however… not sure how well that would go, can’t imagine that would always go well… and if it doesn’t whats the point… just wasting more resources on backups that only save local node stats and records.

the satellites keep track of most things, your node should be able to be recovered even with a full loss of all databases.

No, just needed information about piece is stored inside the piece file. All databases used for statistic now, not a parity for the satellite database.

was referring to this… tho not sure if that ever was implemented, did seem like a good idea to avoid / decrease corruption issues.
yeah looks like something was done in that regard.

i didn’t mean that the satellites used the databases from the nodes, but that it functions like a receipt.

like storagenode logs are sort of a parity / receipt of the satellite logs, which can in case of issues or inconsistencies be compared to track it down…

not saying this happens automatically.

and the point was to attempt to describe why most if not all of the databases on the storagenode can be replaced / deleted without the node dying.

thus making

this pointless…

This issue is still open, so unlikely it’s implemented.

1 Like