Unrecoverable error {"error": "Error starting master database on storagenode: database: file is not a database\n\tstorj.io/storj/storagenode/storagenodedb

Hello

I just came back from 2 weeks vacation and found out that my 2nd node is not online. In the log I see this error:

2021-08-01T10:26:34.533Z FATAL Unrecoverable error {“error”: “Error starting master database on storagenode: database: file is not a database\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:323\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:305\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:248\n\tmain.cmdRun:160\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:385\n\truntime.main:204”, “errorVerbose”: “Error starting master database on storagenode: database: file is not a database\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:323\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:305\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:248\n\tmain.cmdRun:160\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:385\n\truntime.main:204\n\tmain.cmdRun:162\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:385\n\truntime.main:204”}

Can anyone please help me resolve this?

Thank you in advance!

hope you had a nice time :slight_smile:

Sounds like it can’t read / access / validate some of the node Database files, so as a starter.

1 - run a file system check on the nodes data folder, including the databases;

Powershell - Repair-Volume -DriveLetter XXXXXX -OfflineScanAndFix

Linux - fsck <mount point or drive>

2 - try this How to fix a "database disk image is malformed" – Storj

I’m agree with @CutieePie - you need to check your disk for errors first and fix them.

Unfortunately this article will not help to solve the problem when the database file is heavily corrupted (“not a database”).

@zeroheat Please, use this article to recover your storagenode: How to fix database: file is not a database error – Storj

Thank you guys, I have started an fsck on the disk and let’s see what it shows.

Hello
Here is the output:

[root@localhost ~]# fsck /mnt/storenode1
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sdb2: clean, 1302969/39387136 files, 79600732/157543430 blocks
[root@localhost ~]#

It completes in seconds. It does not provide alot of feedback. Not very user friendly. Have I executed it the right way?

You should unmount the disk first, thus need to stop the storagenode and unmount the disk, then check.
If all clean - mount it back and continue with database recover.

I did that and I pasted the output. The command doesn’t run at all if the disk is mounted.

As for the DBs, can you tell me which should I check? I mean are they all located under the same folder? For example,

Then you should specify the device, not the mountpoint.
If you unmounted disk and checked the mountpoint - it’s actually checked the root filesystem, not the disk.
All databases need to be checked, to figure out which are corrupted. You can use scripts from How to fix a "database disk image is malformed" – Storj to check them all.

Here is what i got:

[root@localhost storage]# docker run --rm -it --mount type=bind,source=/mnt/storenode1/storage,destination=/data sstc/sqlite3 find . -iname “*.db” -maxdepth 1 -print0 -exec sqlite3 ‘{}’ ‘PRAGMA integrity_check;’ ‘;’
Unable to find image ‘sstc/sqlite3:latest’ locally
latest: Pulling from sstc/sqlite3
9aae54b2144e: Pull complete
5d1b1e1c175c: Pull complete
839b0660aa01: Pull complete
Digest: sha256:c94be9deb995da9b0bc8f5fe9aa6c023bf75e510a30ae20be84b804f0d79638d
Status: Downloaded newer image for sstc/sqlite3:latest
./piece_spaced_used.dbok
./notifications.dbok
./used_serial.dbok
./pieceinfo.dbok
find: ./notifications.db-shm: No such file or directory
./piece_expiration.dbok
./storage_usage.dbok
./secret.dbok
./satellites.dbok
find: ./notifications.db-wal: No such file or directory
./bandwidth.dbok
./notification.dbok
./orders.dbok
./info.dbok
./reputation.dbError: file is not a database
./heldamount.dbok
./pricing.dbok

So if I understand correctly the problematic ones are:

notifications.db
reputation.db

right?

No, only this one is corrupted:

The *.db-wal and *.db-shm should not be exists if storagenode stopped normally. You do not need to remove them, they will be processed automatically when the node would be started.

Can you help me here? Why doesn’t it run?

[root@localhost ~]# fsck sdb2
fsck from util-linux 2.23.2
Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize]
[-I inode_buffer_blocks] [-P process_inode_size]
[-l|-L bad_blocks_file] [-C fd] [-j external_journal]
[-E extended-options] device

Emergency help:
-p Automatic repair (no questions)
-n Make no changes to the filesystem
-y Assume “yes” to all questions
-c Check for bad blocks and add them to the badblock list
-f Force checking even if filesystem is marked clean
-v Be verbose
-b superblock Use alternative superblock
-B blocksize Force blocksize when looking for superblock
-j external_journal Set location of the external journal
-l bad_blocks_file Add to badblocks list
-L bad_blocks_file Set badblocks list
[root@localhost ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 19G 0 part
├─centos-root 253:0 0 17G 0 lvm /
└─centos-swap 253:1 0 2G 0 lvm [SWAP]
sdb 8:16 0 698.7G 0 disk
├─sdb1 8:17 0 97.7G 0 part
└─sdb2 8:18 0 601G 0 part
sr0 11:0 1 1024M 0 rom

I think I managed to make it to give some output:

[root@localhost ~]# fsck -pvf /mnt/storenode1
fsck from util-linux 2.23.2

 1302966 inodes used (3.31%, out of 39387136)
   36104 non-contiguous files (2.8%)
    6528 non-contiguous directories (0.5%)
         # of inodes with ind/dind/tind blocks: 0/0/0
         Extent depth histogram: 1299293/3664
79600545 blocks used (50.53%, out of 157543430)
       0 bad blocks
       1 large file

 1290647 regular files
   12310 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets

 1302957 files

So the prebolem should be with only this 1 DB right? reputation.db ?

Hi

I re - created the db and the errors changed to:

2021-08-02T12:32:38.465Z INFO Telemetry enabled {“instance ID”: “1fDbYw3z9DxdBGiASZqzukh6v9uYXqUVuKEz25MpPsW6ak8TVK”}
2021-08-02T12:32:38.492Z INFO db.migration Database Version {“version”: 52}
2021-08-02T12:32:40.154Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
2021-08-02T12:32:41.023Z INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
2021-08-02T12:32:41.023Z INFO bandwidth Performing bandwidth usage rollups
2021-08-02T12:32:41.024Z INFO Node 1fDbYw3z9DxdBGiASZqzukh6v9uYXqUVuKEz25MpPsW6ak8TVK started
2021-08-02T12:32:41.024Z INFO Public server started on [::]:28967
2021-08-02T12:32:41.024Z INFO Private server started on 127.0.0.1:7778
2021-08-02T12:32:41.024Z INFO trust Scheduling next refresh {“after”: “5h9m45.051543818s”}
2021-08-02T12:32:41.026Z WARN piecestore:monitor Disk space is less than requested. Allocated space is {“bytes”: 286980734976}
2021-08-02T12:32:41.026Z ERROR piecestore:monitor Total disk space is less than required minimum {“bytes”: 500000000000}
2021-08-02T12:32:41.026Z ERROR services unexpected shutdown of a runner {“name”: “piecestore:monitor”, “error”: “piecestore monitor: disk space requirement not met”, “errorVerbose”: “piecestore monitor: disk space requirement not met\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:123\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.027Z ERROR nodestats:cache Get pricing-model/join date failed {“error”: “context canceled”}
2021-08-02T12:32:41.027Z ERROR gracefulexit:blobscleaner couldn’t receive satellite’s GE status {“error”: “context canceled”}
2021-08-02T12:32:41.027Z ERROR gracefulexit:chore error retrieving satellites. {“error”: “satellitesdb: context canceled”, “errorVerbose”: “satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:100\n\tstorj.io/storj/storagenode/gracefulexit.(*service).ListPendingExits:89\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run.func1:53\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:50\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.027Z ERROR collector error during collecting pieces: {“error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:39\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:521\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:88\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.028Z ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:41.078Z ERROR piecestore:cache error getting current used space: {“error”: “context canceled; context canceled; context canceled; context canceled; context canceled; context canceled”, “errorVerbose”: “group:\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled”}
2021-08-02T12:32:45.345Z FATAL Unrecoverable error {“error”: “piecestore monitor: disk space requirement not met”, “errorVerbose”: “piecestore monitor: disk space requirement not met\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:123\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2021-08-02T12:32:52.457Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}

The node is still with status offline. Please help.

You need to specify the absolute path to the device, so it should looks like

fsck /dev/sdb2

Could you please show result of the command

df -HT

Here it is:

[root@localhost ~]# df -HT
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 498M 0 498M 0% /dev
tmpfs tmpfs 510M 0 510M 0% /dev/shm
tmpfs tmpfs 510M 8.2M 502M 2% /run
tmpfs tmpfs 510M 0 510M 0% /sys/fs/cgroup
/dev/mapper/centos-root xfs 19G 2.7G 16G 15% /
/dev/sda1 xfs 1.1G 176M 888M 17% /boot
/dev/sdb2 ext4 636G 316G 287G 53% /mnt/storenode1
overlay overlay 19G 2.7G 16G 15% /var/lib/docker/overlay2/126ac1c207d8c796216a5e5fde434b188b050c3787f1e5816911fe907221a068/merged
overlay overlay 19G 2.7G 16G 15% /var/lib/docker/overlay2/5f3d3e198b1aea125ba7e1ec926559136e7909e9a0e199ae93698bb4d205c6b8/merged
tmpfs tmpfs 102M 0 102M 0% /run/user/0

What do you have there?

ls -l /mnt/storenode1
ls -l /mnt/storenode1/storage

[root@localhost ~]# ls -l /mnt/storenode1
total 308
-rw------- 1 root root 8573 Aug 3 16:00 config.yaml
drwx------ 2 root root 4096 Feb 15 12:24 lost+found
-rw-r–r-- 1 root root 243777 Aug 3 16:20 node1.log
drwx------ 4 root root 4096 Jan 6 2021 orders
-rw------- 1 root root 32768 Aug 3 16:19 revocations.db
drwx------ 6 root root 4096 Aug 3 16:20 storage
-rwxr-xr-x 1 root root 9781 May 25 01:55 successrate.sh
-rw------- 1 root root 2730 Aug 3 16:19 trust-cache.json
[root@localhost ~]# ls -l /mnt/storenode1/storage
total 1872
-rw-r–r-- 1 root root 32768 Aug 3 16:20 bandwidth.db
drwx------ 8 root root 4096 Jan 7 2021 blobs
-rw-r–r-- 1 root root 743134 Aug 1 10:15 dump_all2.sql
-rw-r–r-- 1 root root 743107 Aug 1 10:16 dump_all_notrans2.sql
-rw-r–r-- 1 root root 674 Aug 1 10:11 dump_all_notrans.sql
-rw-r–r-- 1 root root 701 Aug 1 10:09 dump_all.sql
drwx------ 2 root root 4096 Jul 22 15:01 garbage
-rw-r–r-- 1 root root 32768 Aug 3 16:20 heldamount.db
-rw-r–r-- 1 root root 16384 Aug 3 16:20 info.db
-rw-r–r-- 1 root root 24576 Aug 3 16:20 notifications.db
-rw-r–r-- 1 root root 32768 Aug 3 16:20 orders.db
-rw-r–r-- 1 root root 36864 Aug 3 16:20 piece_expiration.db
-rw-r–r-- 1 root root 24576 Aug 3 16:20 pieceinfo.db
-rw-r–r-- 1 root root 24576 Aug 1 10:08 pieceinfo.db.bak
-rw-r–r-- 1 root root 24576 Aug 3 16:20 piece_spaced_used.db
-rw-r–r-- 1 root root 24576 Aug 3 16:20 pricing.db
-rw-r–r-- 1 root root 24576 Aug 3 16:20 reputation.db
-rw-r–r-- 1 root root 32768 Aug 3 16:20 satellites.db
-rw-r–r-- 1 root root 24576 Aug 3 16:20 secret.db
-rw-r–r-- 1 root root 32 Jan 7 2021 storage-dir-verification
-rw-r–r-- 1 root root 20480 Aug 3 16:20 storage_usage.db
drwx------ 2 root root 4096 Jul 22 15:02 temp
drwx------ 8 root root 4096 Jan 13 2021 trash
-rw-r–r-- 1 root root 20480 Aug 3 16:20 used_serial.db

Is there anything suspicious?

I’m guessing your node currently isn’t aware of how much data is already being used by the node. It will need to run the filewalker to find out again, but never gets the chance, because the free space on the disk is too low for the node to start to begin with. You can temporarily edit the config.yaml file to lower the minimum node size to 200 GB.

You would need to edit the following line:

# storage2.monitor.minimum-disk-space: 500.0 GB

to

storage2.monitor.minimum-disk-space: 200.0 GB

Note the removed # at the start as well.

The node should start after that and you can finish the db repair steps if you were still working on that. After your node has been running successfully again for a bit (Give the file walker time to finish. A day should be more than enough to be sure.), you can change the setting back and restart the node. It should then no longer complain about the minimum size. Though it isn’t really necessary to change the setting back, I prefer to run on as few custom settings as possible usually.

1 Like

It managed to start but something is terribly wrong:


image

It is normal for the “Online” status. However, I had over 200GB of used space on that node. 200GB data. why does it show only lessthan 100MB now?

It seems it started now as a totally new node. I will stop it until we get to the bottom of this.