Error starting master database on storagenode: database: file is not a database

jjj · October 28, 2020, 1:10pm

Hi i woke up this morning to this error. the docker container was in a continuous reboot and i see this in the log. can anyone send me in the right direction here. thx

jjj · October 28, 2020, 1:12pm

Error: Error starting master database on storagenode: database: file is not a database
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:350
        storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:336
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:313
        storj.io/storj/storagenode/storagenodedb.Open:245
        main.cmdRun:152
        storj.io/private/process.cleanup.func1.4:362
        storj.io/private/process.cleanup.func1:380
        github.com/spf13/cobra.(*Command).execute:842
        github.com/spf13/cobra.(*Command).ExecuteC:950
        github.com/spf13/cobra.(*Command).Execute:887
        storj.io/private/process.ExecWithCustomConfig:88
        storj.io/private/process.ExecCustomDebug:70
        main.main:336
        runtime.main:204

jjj · October 28, 2020, 1:22pm

after reboot i now get this

Error: Error starting master database on storagenode: group:
— stat config/storage/blobs: bad message
— stat config/storage/temp: bad message
— stat config/storage/garbage: bad message
— stat config/storage/trash: bad message

cameron · October 28, 2020, 3:42pm

It appears that other operators may be experiencing the same issue

I’ll see if we can get someone to investigate

Odmin · October 28, 2020, 3:50pm

I found something interesting, will create a new topic soon. (soon = today, tomorrow)

jjj · October 28, 2020, 3:56pm

i have reserved myself to a rebuild i have inly had this node up a few weeks. i will rebuild the array and start from scratch so its clean. tried it all but couldnt figure it out. seems some sort of issue with the db files being corrupt, could be the autoupdate killed it not really sure. oly had 50gig of the 14Tb populated

Alexey · October 28, 2020, 6:41pm

Please, check your databases

And post here results

jjj · October 28, 2020, 6:54pm

Thanks, i took a look at this however i was unable to figure out how to do this on the linux command prompt simply. I guess next time i will have to go through this more intensively if this happens after the rebuild

ThePiGuy · October 29, 2020, 12:34am

I am having the exact same issue after this upgrade. It appears something went really wrong with this upgrade. Ugh, serves me right for trying to have automatic upgrades turned on.

I began following through the linked article, but I have several corrupted databases. If I’m doing this for a dollar or two, I’m gonna quit and not return. If multiple people had this issue, someone needs to write a script to fix this.

Here is the output from the diagnostics test in step 5.

find . -iname "*.db" -maxdepth 1 -print0 -exec sqlite3 '{}' 'PRAGMA integrity_check;' ';'

./info.dbok
./bandwidth.dbok
./orders.dbok
./piece_expiration.db*** in database main ***
Main freelist: freelist leaf count too big on page 6
Main freelist: invalid page number 218103808
On tree page 2 cell 0: invalid page number 9
On tree page 2 cell 0: invalid page number 8
On tree page 3 cell 0: invalid page number 11
On tree page 3 cell 0: invalid page number 10
Page 5 is never used
Error: database disk image is malformed
./pieceinfo.dbok
./piece_spaced_used.dbok
./reputation.dbok
./storage_usage.dbok
./used_serial.dbok
./satellites.dbError: file is not a database
./notifications.dbok
./heldamount.dbok
./pricing.dbNULL value in pieceinfo_.piece_size
NULL value in pieceinfo_.order_limit
NULL value in pieceinfo_.uplink_piece_hash
NULL value in pieceinfo_.uplink_cert_id
NULL value in pieceinfo_.piece_creation
Error: database disk image is malformed
./orders.dbError: file is not a database

Thanks for the help!

jjj · October 29, 2020, 1:05am

yes i had auto update as well. must have blown it up

Did the full rebuild but now i get the following error

Error: Error starting master database on storagenode: group:
— stat config/storage/blobs: bad message
— stat config/storage/temp: bad message
— stat config/storage/garbage: bad message
— stat config/storage/trash: bad message

Alexey · October 29, 2020, 7:34am

If by “full rebuild” you mean that you have removed the data and identity, then you should remove everything from the data location before the start a new node with a new identity.

If you didn’t touch the customers data and identity, then you should run a fsck for that disk and fix any errors first.

It would be helpful if you could describe steps included into your “full rebuild”.

Alexey · October 29, 2020, 8:00am

Hello @ThePiGuy,
Welcome to the forum!

How is your disk connected?
I would like to recommend you to check your disk with fsck for other errors first. Stop and remove the storagenode container before the check and unmount the disk. Run fsck on it with errors correction and then mount the disk back.

To re-create several databases:

Stop the storagenode
Remove piece_expiration.db, satellites.db, pricing.db, orders.db from the data location
Move remained databases to the other folder, backup for example (please, use your own paths!):

mkdir /mnt/storj/storagenode/backup
mv /mnt/storj/storagenode/storage/*.db /mnt/storj/storagenode/backup/

Move the config file to the backup too (please, correct path!):

mv /mnt/storj/storagenode/config.yaml /mnt/storj/storagenode/backup/

Execute this command (please, correct path! Please, note, there is only one mount and no one parameter, this is important):

docker pull storjlabs/storagenode:latest
docker run -it --rm -v /mnt/storj/storagenode:/app/config storjlabs/storagenode:latest

It will throw error and exit. This is expected.

Move all saved databases and config back from the backup (please, correct path!):

mv /mnt/storj/storagenode/backup/config.yaml /mnt/storj/storagenode/
mv /mnt/storj/storagenode/backup/*.db /mnt/storj/storagenode/storage/

Start the storagenode
Check your logs

Also, I would like to ask you to update your docker run command with the latest version: https://documentation.storj.io/setup/cli/storage-node#running-the-storage-node seems it doesn’t have a timeout included.

The same is going for the watchtower:

Stop and remove the watchtower

docker stop watchtower
docker rm --force $(docker ps -aqf ancestor=storjlabs/watchtower)

https://documentation.storj.io/setup/cli/software-updates#automatic-updates

jjj · October 31, 2020, 12:54am

Sure,

Stopped the node

wiped and rebuilt the array

Triple checked it for errors

restarted the docker node

All seems ok so far

jjj · October 31, 2020, 12:55am

i am sure my node has lost some credibility however it had only been live a few weeks so was still in the validation phase. What i have done differently this time is left off the autoupdate

Alexey · October 31, 2020, 12:47pm

Such errors are suggests to check the disk, because your filesystem is corrupted.

What is type of the RAID (how you did it)?
What is filesystem?
Is it a network attached drive?

jjj · October 31, 2020, 3:50pm

Yes, a disk in the array had failed and this significantly impacted read times and caused corruption. Full rebuild was necessary but due to this being at a very similar time to the autoupdate i turned that off after rebuilding. Its an older PC so may be coincidence however in my experience that is a very rare thing. Thanks to everyone who spent time on this

Alexey · October 31, 2020, 4:02pm

You can take a look on this thread why we do not recommend to use RAID with today’s disks:

jjj · October 31, 2020, 11:03pm

Hmm looks like the node has now been disqualified after the rebuild i just got a mail. What are the next steps do i need to go through the validation process again and get a new id? Thanks

Alexey · November 1, 2020, 9:54am

Of course. If you deleted customers’ data - your node will be disqualified.
You should create a new identity, receive a new authorization token, sign the identity and start with a clean storage.

HeroHann · December 7, 2020, 7:32am

Got the same error:
Error: rpc: dial tcp 127.0.0.1:7778: connect: connection refused

Just one node here. Don’t know what to do?