Hi i woke up this morning to this error. the docker container was in a continuous reboot and i see this in the log. can anyone send me in the right direction here. thx
Error: Error starting master database on storagenode: database: file is not a database
storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:350
storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:336
storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:313
storj.io/storj/storagenode/storagenodedb.Open:245
main.cmdRun:152
storj.io/private/process.cleanup.func1.4:362
storj.io/private/process.cleanup.func1:380
github.com/spf13/cobra.(*Command).execute:842
github.com/spf13/cobra.(*Command).ExecuteC:950
github.com/spf13/cobra.(*Command).Execute:887
storj.io/private/process.ExecWithCustomConfig:88
storj.io/private/process.ExecCustomDebug:70
main.main:336
runtime.main:204
after reboot i now get this
Error: Error starting master database on storagenode: group:
— stat config/storage/blobs: bad message
— stat config/storage/temp: bad message
— stat config/storage/garbage: bad message
— stat config/storage/trash: bad message
It appears that other operators may be experiencing the same issue
I’ll see if we can get someone to investigate
I found something interesting, will create a new topic soon. (soon = today, tomorrow)
i have reserved myself to a rebuild i have inly had this node up a few weeks. i will rebuild the array and start from scratch so its clean. tried it all but couldnt figure it out. seems some sort of issue with the db files being corrupt, could be the autoupdate killed it not really sure. oly had 50gig of the 14Tb populated
Please, check your databases
And post here results
Thanks, i took a look at this however i was unable to figure out how to do this on the linux command prompt simply. I guess next time i will have to go through this more intensively if this happens after the rebuild
I am having the exact same issue after this upgrade. It appears something went really wrong with this upgrade. Ugh, serves me right for trying to have automatic upgrades turned on.
I began following through the linked article, but I have several corrupted databases. If I’m doing this for a dollar or two, I’m gonna quit and not return. If multiple people had this issue, someone needs to write a script to fix this.
Here is the output from the diagnostics test in step 5.
find . -iname "*.db" -maxdepth 1 -print0 -exec sqlite3 '{}' 'PRAGMA integrity_check;' ';'
./info.dbok
./bandwidth.dbok
./orders.dbok
./piece_expiration.db*** in database main ***
Main freelist: freelist leaf count too big on page 6
Main freelist: invalid page number 218103808
On tree page 2 cell 0: invalid page number 9
On tree page 2 cell 0: invalid page number 8
On tree page 3 cell 0: invalid page number 11
On tree page 3 cell 0: invalid page number 10
Page 5 is never used
Error: database disk image is malformed
./pieceinfo.dbok
./piece_spaced_used.dbok
./reputation.dbok
./storage_usage.dbok
./used_serial.dbok
./satellites.dbError: file is not a database
./notifications.dbok
./heldamount.dbok
./pricing.dbNULL value in pieceinfo_.piece_size
NULL value in pieceinfo_.order_limit
NULL value in pieceinfo_.uplink_piece_hash
NULL value in pieceinfo_.uplink_cert_id
NULL value in pieceinfo_.piece_creation
Error: database disk image is malformed
./orders.dbError: file is not a database
Thanks for the help!
yes i had auto update as well. must have blown it up
Did the full rebuild but now i get the following error
Error: Error starting master database on storagenode: group:
— stat config/storage/blobs: bad message
— stat config/storage/temp: bad message
— stat config/storage/garbage: bad message
— stat config/storage/trash: bad message
If by “full rebuild” you mean that you have removed the data and identity, then you should remove everything from the data location before the start a new node with a new identity.
If you didn’t touch the customers data and identity, then you should run a fsck
for that disk and fix any errors first.
It would be helpful if you could describe steps included into your “full rebuild”.
Hello @ThePiGuy,
Welcome to the forum!
How is your disk connected?
I would like to recommend you to check your disk with fsck
for other errors first. Stop and remove the storagenode container before the check and unmount the disk. Run fsck
on it with errors correction and then mount the disk back.
To re-create several databases:
- Stop the storagenode
- Remove
piece_expiration.db
,satellites.db
,pricing.db
,orders.db
from the data location - Move remained databases to the other folder,
backup
for example (please, use your own paths!):
mkdir /mnt/storj/storagenode/backup
mv /mnt/storj/storagenode/storage/*.db /mnt/storj/storagenode/backup/
- Move the config file to the
backup
too (please, correct path!):
mv /mnt/storj/storagenode/config.yaml /mnt/storj/storagenode/backup/
- Execute this command (please, correct path! Please, note, there is only one mount and no one parameter, this is important):
docker pull storjlabs/storagenode:latest
docker run -it --rm -v /mnt/storj/storagenode:/app/config storjlabs/storagenode:latest
It will throw error and exit. This is expected.
- Move all saved databases and config back from the
backup
(please, correct path!):
mv /mnt/storj/storagenode/backup/config.yaml /mnt/storj/storagenode/
mv /mnt/storj/storagenode/backup/*.db /mnt/storj/storagenode/storage/
- Start the storagenode
- Check your logs
Also, I would like to ask you to update your docker run
command with the latest version: https://documentation.storj.io/setup/cli/storage-node#running-the-storage-node seems it doesn’t have a timeout included.
The same is going for the watchtower:
- Stop and remove the watchtower
docker stop watchtower
docker rm --force $(docker ps -aqf ancestor=storjlabs/watchtower)
Sure,
Stopped the node
wiped and rebuilt the array
Triple checked it for errors
restarted the docker node
All seems ok so far
i am sure my node has lost some credibility however it had only been live a few weeks so was still in the validation phase. What i have done differently this time is left off the autoupdate
Such errors are suggests to check the disk, because your filesystem is corrupted.
What is type of the RAID (how you did it)?
What is filesystem?
Is it a network attached drive?
Yes, a disk in the array had failed and this significantly impacted read times and caused corruption. Full rebuild was necessary but due to this being at a very similar time to the autoupdate i turned that off after rebuilding. Its an older PC so may be coincidence however in my experience that is a very rare thing. Thanks to everyone who spent time on this
You can take a look on this thread why we do not recommend to use RAID with today’s disks:
Hmm looks like the node has now been disqualified after the rebuild i just got a mail. What are the next steps do i need to go through the validation process again and get a new id? Thanks
Of course. If you deleted customers’ data - your node will be disqualified.
You should create a new identity, receive a new authorization token, sign the identity and start with a clean storage.
Got the same error:
Error: rpc: dial tcp 127.0.0.1:7778: connect: connection refused
Just one node here. Don’t know what to do?