1 of 3 nodes went offline and refuses to come back

arjones85 · November 14, 2023, 9:41pm

Keep getting the following error when starting the docker node:

2023-11-14T21:35:52Z INFO Configuration loaded {“process”: “storagenode”, “Location”: “/app/config/config.yaml”}
2023-11-14T21:35:52Z INFO Anonymized tracing enabled {“process”: “storagenode”}
2023-11-14T21:35:52Z INFO Operator email {“process”: “storagenode”, “Address”: “”}
2023-11-14T21:35:52Z INFO Operator wallet {“process”: “storagenode”, “Address”: “0xEDA46769ac169097A0A90F355a92BB4098B04334”}
2023-11-14T21:35:52Z INFO server kernel support for server-side tcp fast open remains disabled. {“process”: “storagenode”}
2023-11-14T21:35:52Z INFO server enable with: sysctl -w net.ipv4.tcp_fastopen=3 {“process”: “storagenode”}
2023-11-14T21:35:52Z INFO Telemetry enabled {“process”: “storagenode”, “instance ID”: “12h7iT8pxEJiQkkTRaQFvUomifVD7mRWosYuNwV4TkB7d4JZvAr”}
2023-11-14T21:35:52Z INFO Event collection enabled {“process”: “storagenode”, “instance ID”: “12h7iT8pxEJiQkkTRaQFvUomifVD7mRWosYuNwV4TkB7d4JZvAr”}
2023-11-14T21:35:52Z FATAL Unrecoverable error {“process”: “storagenode”, “error”: “Error creating tables for master database on storagenode: migrate: database: info opening file "config/storage/info.db" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267”, “errorVerbose”: “Error creating tables for master database on storagenode: migrate: database: info opening file "config/storage/info.db" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267\n\tmain.cmdRun:97\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267”}

All 3 nodes I am running are identical configuration wise aside from ports, and their mount points. The mount point of this node didn’t experience any errors, and I am able to read and write from the drive without an issue.

The file it’s complaining about, info.db, is there:

[root@jonesmedia storage]# pwd
/mnt/StorjData/storage/storage
[root@jonesmedia storage]# ls -al info.db
-rwxr-xr-x 1 root root 16384 Nov 14 01:17 info.db

Anyone else have this issue before?

JWvdV · November 14, 2023, 10:11pm

Are you sure, your drives are connected and accessible for the docker process?
It seems like your files aren’t there.

Do you run your docker as root (assuming you run docker)?
And are you really sure, your mount points you’re able to access are the same mount points the storagenode is using? So, are you using the right arguments to the docker process (assung you run docker)?
What file system are you using? And in what way are they connected (USB, SATA, NVMe…?).

Perhaps you should show the script you’re using.

Further, you could search for “Error creating tables for master database on storagenode” in the forum. There are plenty topics like this one.

arjones85 · November 14, 2023, 11:46pm

Thanks!

Yes, the mount point is accessible. That’s why this is so damn confusing to me. Here’s my compose file:

[root@jonesmedia compose]# cat storj.yaml

version: “2.1”
services:
storagenode:
image: storjlabs/storagenode
container_name: storagenode
environment:
- WALLET=
- EMAIL=
- ADDRESS=
- STORAGE=3.95TB
- SETUP=false
volumes:
- /mnt/StorjData/identity:/app/identity
- /mnt/StorjData/storage:/app/config
- /var/log/storj:/log
ports:
- 8383:14002
- 28967:28967/tcp
- 28967:28967/udp
restart: unless-stopped

[root@jonesmedia compose]# ll /mnt/StorjData/storage
total 48
-rwxr-xr-x 1 root root 8094 Oct 9 17:15 config.yaml
drwx------ 4 root root 47 Sep 14 2020 orders
-rwxr-xr-x 1 root root 32768 Nov 14 15:35 revocations.db
drwxr-xr-x 6 root root 4096 Nov 14 15:35 storage
-rw------- 1 root root 2289 Nov 14 01:12 trust-cache.json

Yes, I am running it as root, and the filesystem is XFS.

I’ll do some more searching on the forum. This issue makes no damn sense. Nothing changed before it went down last night. It just… went down and now refuses to start back up saying it can’t find the file.

arjones85 · November 14, 2023, 11:48pm

Matter of fact, I managed to get into the container before it terminated. It can reach the mount and the data…

root@20a59c0b9141:/app# c[root@jonesmedia compose]# docker exec -it storagenode bash
root@20a59c0b9141:/app# cd config
root@20a59c0b9141:/app/config# ls -al
total 52
drwxr-xr-x 4 root root 124 Nov 14 21:42 .
drwxrwxrwx 1 root root 4096 Nov 14 21:35 …
-rwxr-xr-x 1 root root 8094 Oct 9 22:15 config.yaml
drwx------ 4 root root 47 Sep 14 2020 orders
-rwxr-xr-x 1 root root 32768 Nov 14 23:47 revocations.db
drwxr-xr-x 6 root root 4096 Nov 14 23:47 storage
-rw------- 1 root root 2289 Nov 14 07:12 trust-cache.json
root@20a59c0b9141:/app/config# cd storage
root@20a59c0b9141:/app/config/storage# ls -al
total 678384
drwxr-xr-x 6 root root 4096 Nov 14 23:47 .
drwxr-xr-x 4 root root 124 Nov 14 21:42 …
-rwxr-xr-x 1 root root 76861440 Nov 14 07:17 bandwidth.db
drwxr-xr-x 9 root root 4096 Jan 7 2021 blobs
drwx------ 2 root root 10 Nov 14 07:12 garbage
-rwxr-xr-x 1 root root 135168 Nov 14 07:17 heldamount.db
-rwxr-xr-x 1 root root 16384 Nov 14 07:17 info.db
-rwxr-xr-x 1 root root 24576 Nov 14 07:17 notifications.db
-rwxr-xr-x 1 root root 561786880 Nov 14 07:17 orders.db
-rwxr-xr-x 1 root root 34365440 Nov 14 07:17 piece_expiration.db
-rwxr-xr-x 1 root root 24576 Nov 14 07:17 piece_spaced_used.db
-rwxr-xr-x 1 root root 24576 Nov 14 07:17 pieceinfo.db
-rwxr-xr-x 1 root root 24576 Nov 14 07:17 pricing.db
-rwxr-xr-x 1 root root 36864 Nov 14 07:17 reputation.db
-rwxr-xr-x 1 root root 32768 Nov 14 07:17 satellites.db
-rw-r–r-- 1 root root 24576 Nov 14 07:17 secret.db
-rw-r–r-- 1 root root 32 Dec 9 2020 storage-dir-verification
-rwxr-xr-x 1 root root 1662976 Nov 14 07:17 storage_usage.db
drwxr-xr-x 2 root root 12288 Nov 14 07:13 temp
drwxr-xr-x 8 root root 4096 Jan 12 2021 trash
-rwxr-xr-x 1 root root 19595264 Nov 14 07:17 used_serial.db
root@20a59c0b9141:/app/config/storage#

Alexey · November 15, 2023, 3:41am

Hello @arjones85,
Welcome to the forum!

Please try to check your filesystem on errors and fix them.
Then check databases:

You may also try to recreate this database file:

I would also suggest to check, does it created a volume instead of binding?

docker volume ls

and configure your docker-compose.yaml to use binding to avoid creation of volumes if the source is not available:

    volumes:
      - type: bind
        source: /mnt/StorjData/identity
        target: /app/identity
      - type: bind
        source: /mnt/StorjData/storage
        target: /app/config

arjones85 · November 15, 2023, 5:24am

Thanks Alexey!

The filesystem/mount is clean. I am able to read/write to it without an issue, and it’s XFS so there’s not really a check you can run on it.

I made the requested change in the compose file but it didn’t make a difference. There is one change though, it’s complaining about the secrets.db file not being there now. Same “File not found” error.

However, if I enter the container, I can copy the same dang file it’s saying isn’t there!

[root@jonesmedia ~]# docker exec -it storagenode bash
root@82df72f8ccd5:/app# cp -v /app/config/storage/secret.db /tmp
‘/app/config/storage/secret.db’ → '/tmp/secret.db

Why the heck would it say the file isn’t there, if I can even copy it? This makes no sense to me. To reiterate, I have two other storj nodes running on this same physical hardware, same exact config and compose files except for the mountpoints, working just fine.

2023-11-15T05:21:49Z FATAL Unrecoverable error {“process”: “storagenode”, “error”: "Error starting master database on storagenode: database: secret opening file "config/storage/secret.db" failed: unable to open database file: no such file or directory

Eugene · November 15, 2023, 6:02am

I got something similar yesterday with one of my nodes. All of a sudden the node went offline on UptimeRobot.
I checked the log and it was complaining about not being able to open revocations.db
Looks like my mount point became somehow read-only. I stopped the node, unmounted the drive, mounted it back with rw permissions and started the node. Node started normally.
Hope it helps.

Alexey · November 15, 2023, 8:29am

Actually you can check and fix the xfs filesystem:

Is it possible that you checks a wrong logfile, if you redirected it to the same folder as two other nodes?

arjones85 · November 15, 2023, 3:00pm

Thank you! I didn’t realize xfs_repair would actually check things. I have it running now. I don’t expect it to find anything wrong as I haven’t seen anything in the syslogs indicated a filesystem issue, and I can access the files inside the container, but I’ll report back once it’s done.

I could have been more clear about the config, sorry about that. The logs are writing to their own separate folders, so I am accessing the correct ones.

Are there any debug flags I can enable to be able to see the full path the process is trying to find the files with?

arjones85 · November 15, 2023, 7:33pm

xfs_repair finished and didn’t indicate any errors:

[root@jonesmedia storj]# xfs_repair /dev/sdd
Phase 1 - find and verify superblock…
Phase 2 - using internal log
- zero log…
- scan filesystem freespace and inode maps…
- found root inode chunk
Phase 3 - for each AG…
- scan and clear agi unlinked lists…
- process known inodes and perform inode discovery…
- agno = 0

^A - agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes…
Phase 4 - check for duplicate blocks…
- setting up duplicate extent list…
- check for inodes claiming duplicate blocks…
- agno = 0
- agno = 1
- agno = 2
- agno = 3
Phase 5 - rebuild AG headers and trees…
- reset superblock…
Phase 6 - check inode connectivity…
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem …
- traversal finished …
- moving disconnected inodes to lost+found …
Phase 7 - verify and correct link counts…
done
[root@jonesmedia storj]#

However, starting the node… and it stays up now and can apparently see all the database files now!

What in the world? The file it was complaining about not being able to see, that I could access just fine inside the container manually, apparently can be found now after xfs_repair finished with no errors being found by the utility.

… I feel like I’m in crazy land over here.

daki82 · November 16, 2023, 1:32am

maybe it got locked somehow, and unlocked in the process/time?
aka “have you tried to turn it of and on again?”

Eugene · November 16, 2023, 2:35am

Most likely xfs_repair unmounts file system before repair and re-mounts it after repair.
Same steps (unmount/re-mount) helped me to bring my node online.

JWvdV · November 16, 2023, 5:28am

Well, there’s also a difference between the message complaining about a certain error and the real error.

I would look in lost+found first, before saying there was no error. My experience with XFS, is that it’s a great file system. But not so verbose.

Eugene · November 17, 2023, 9:02am

In my case the same node went offline again.
Here is the error I got in the log.

2023-11-17T08:43:41Z FATAL Unrecoverable error {“process”: “storagenode”, “error”: “Error creating revocation database: revocation database: boltdb: open config/revocations.db: read-only file system\n\tstorj.io/storj/private/kvstore/boltdb.New:42\n\tstorj.io/storj/private/revocation.openDBBolt:52\n\tstorj.io/storj/private/revocation.OpenDB:35\n\tstorj.io/storj/private/revocation.OpenDBFromCfg:23\n\tmain.cmdRun:71\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267”, “errorVerbose”: “Error creating revocation database: revocation database: boltdb: open config/revocations.db: read-only file system\n\tstorj.io/storj/private/kvstore/boltdb.New:42\n\tstorj.io/storj/private/revocation.openDBBolt:52\n\tstorj.io/storj/private/revocation.OpenDB:35\n\tstorj.io/storj/private/revocation.OpenDBFromCfg:23\n\tmain.cmdRun:71\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267\n\tmain.cmdRun:73\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:267”}

Checked the file systems mounted read-only

$ grep “[[:space:]]ro[[:space:],]” /proc/mounts
/dev/sdb1 /media/storj/storj02 ext4 ro,nosuid,nodev,relatime 0 0

stopped the node
$ sudo umount /media/storj/storj02

scanned filesystem, found some errors
$ sudo fsck.ext4 -p /dev/sdb1
B509 contains a file system with errors, check forced.
B509: Inode 131747, i_blocks is 10256, should be 10232. FIXED.
B509: Inode 131762, i_blocks is 1728, should be 1720. FIXED.
B509: 6498271/366215168 files (1.6% non-contiguous), 2689840688/2929720832 blocks

mounted filesystem back
$ sudo mount -a

started the node, so far so good.

What can cause rw filesystem to re-mount on its own as ro?

Thanks.

Alexey · November 18, 2023, 3:21am

Eugene · November 19, 2023, 8:20pm

Thanks, Alexey!

Will be digging through dmesg.