Filewalker DB suddenly locked - node stopped to work

JeanS · April 17, 2024, 6:06am

Hi,
Suddenly my node stopped to work (no changes on my setup for several months).
I have not able to recover node, by restarting or reconfiguring.

Also my node has been a bit unstable for long time. Quite often QUIC is Misconfigured even if no changes. But it recovers after awhile. So online statistics are worse as it should be (only few short maintenance breaks within couple last months)

2024-04-17T05:00:59Z ERROR failure during run {“Process”: “storagenode”, “error”: "Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file "config/storage/garbage_collection_filewalker_progress.db" failed: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364\n

Error: Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file “config/storage/garbage_collection_filewalker_progress.db” failed: database is locked

2024-04-17 05:00:59,906 INFO exited: storagenode (exit status 1; not expected)

Any ideas how get this node working?

nerdatwork · April 17, 2024, 6:12am

Welcome to the forum @JeanS !

Can you move your dbs to SSD ?

JeanS · April 17, 2024, 6:25am

Actual node is running on Proxmox → Ubuntu Desktop 22.04. This is NVME drive.
And data is stored to TrueNAS and it is using normal 3.5" HDDs

This is command that I am using to start node.
sudo docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967/tcp
-p 28967:28967/udp
-p 127.0.0.1:14002:14002
-e WALLET=“0x83b624C08xxxxxxxxx1b55848”
-e EMAIL="xxx@gmail.com"
-e ADDRESS=“js-storj.ddns.net:28967”
-e STORAGE=“6TB”
–user $(id -u):$(id -g)
–mount type=bind,source=“/home/storj/storagenode”,destination=/app/identity
–mount type=bind,source=“/media/storj/TrueNAS/StorJ_Data”,destination=/app/config
–name storagenode storjlabs/storagenode:latest

daki82 · April 17, 2024, 11:23am

How is the 3.5" physicaly connected ?

May you have a timeout “fatal” error unnoticed?

JeanS · April 17, 2024, 12:47pm

Both StorJ (Ubuntu) and TrueNAS Scale are running as VMs on same Proxmox server.
Storj user data is accessible via SMB Share of this TrueNAS Scale. TrueNAS Storage consisting 10*3.5 HDDs (RaidZ)
As both VMs are physically in same server, so data is accessed just via Proxmox virtual bridge.

Read and write speeds should be better than directly attached harddrive, but delays may be higher → could that cause issue?

To add: After restart I am not able to access anymore to StorJ GUI, so node is not working at all.

Node is on constant stop & start loop:
2024-04-17 12:53:54,489 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

944

2024-04-17 12:53:54,489 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

945

2024-04-17T12:54:03Z ERROR failure during run {“Process”: “storagenode”, “error”: "Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file "config/storage/garbage_collection_filewalker_progress.db" failed: database is locked\n\tstorj.io/
…
024-04-17 12:54:03,401 INFO exited: storagenode (exit status 1; not expected)

962

2024-04-17 12:54:04,405 INFO spawned: ‘storagenode’ with pid 40

963

2024-04-17 12:54:04,405 WARN received SIGQUIT indicating exit request

964

2024-04-17 12:54:04,405 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die

965

2024-04-17T12:54:04Z INFO Got a signal from the OS: “terminated” {“Process”: “storagenode-updater”}

966

2024-04-17 12:54:04,408 INFO stopped: storagenode-updater (exit status 0)

967

2024-04-17 12:54:04,409 INFO stopped: storagenode (terminated by SIGTERM)

968

2024-04-17 12:54:04,410 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

969

2024-04-17 12:54:05,180 INFO Set uid to user 0 succeeded

970

2024-04-17 12:54:05,188 INFO RPC interface ‘supervisor’ initialized

=> then it starts again

JeanS · April 17, 2024, 1:24pm

I renamed this garbage_collection_filewalker_progress.db and now node cannot even read identity.cert

2024-04-17T13:18:43Z FATAL Error loading identity. {“Process”: “storagenode-updater”, “error”: “file or directory not found: open identity/identity.cert: permission denied”,

I think something is corrupted / messed up on my Ubuntu or Docker. I’ll try to do complete reinstallation; delete current VM and then new ubuntu installation etc. But maybe it is too late, now offline 12+ hours

BrightSilence · April 17, 2024, 1:45pm

This is your problem. You can’t use Storj with data over SMB/NFS etc. because the SQLite database locks don’t work reliably over network protocols and this could lead to database corruption. Why not run Storj on the TrueNAS Scale VM?

pangolin · April 17, 2024, 1:47pm

It is not to late. You can always delete all databases. If all databases are missing the node creates them new. You will lose the history in the dashboard, everything else will be restored.

CutieePie · April 17, 2024, 4:57pm

As others have said, SMB at the Linux / Windows level isn’t supported with Sqlite…

I don’t understand why you have TrueNas on Proxmox - there are many issues with how the disks are passed through, they have to be raw devices, that can’t be done in the GUI, it’s like you want to build a slow disk system… Also Proxmox is native ZFS, you could of built the pool directly in Proxmox and skipped TrueNas

However. you have the ZFS pool in TrueNas, so a better option is to export access to the pool directly into Proxmox, then create the Virtual Disks natively.

In Proxmox, under the Datacentre view for Storage add External providers… you can add SMB pointing to your TrueNas Share, and mark it for VM disk storage - Then on the Storj Machine, you can create a new Vdisk, and locate it on the share.

In the Storj Linux OS, if you do a lsblk, you will see the new disk - add it to LVM, setup the VG / LV and format it to ext4, mount it… then use rsync to move the data over.

SMB isn’t suited to Disk images though, it can be slow…

Better option is to export the ZFS pool via ZFS over iSCSI, then map that into the Datacentre Storage, but that is way outside the scope of this forum to setup…

Alternatively, and much more simple, Setup ISCSI on the TrueNas box, export a disk Image, then again map that in the Datacentre view using just iSCSI.

CP

JeanS · April 18, 2024, 6:18am

I created StorJ VM directly to TrueNAS and created ZVol for this VM, thanks on tips!

I tried to install TrueNAS directly to bare HW, but I did not like TrueNAS Virtualization (compared to Proxmox). Also various problems on networking and GPU passthrough issues.
Edit: I tried this about 6months ago. So StorJ was working about 6months on this VM + SMB setup, except random QUIC issues.

So I returned to use Proxmox and run TrueNAS as a VM.

HBA pcie cards are passthrough to TrueNAS, so all harddrives are ‘directly’ connected to TrueNAS.

I hope this setup works better than previous.

Alexey · April 18, 2024, 7:40am

The network filesystems are not supported.

This will not work. The “database is locked” means that the access is locked, so the data location is slow (and this is expected when you use any network filesystems, this is why they are not supported).

Seems you messed up something, if it lost access. However, again, since you use a network filesystem it stopped to work.

Please either use a virtual disk for your VM, or pass thru, or at least iSCSI, or even better - move out storagenode from VMs and run it as a container directly on your Proxmox.

Derezzer · April 30, 2024, 2:19am

I’m also having issues with similar error messages. I too was storing my databases over SMB on TrueNAS and have been running that way for years with ZERO issues. Now suddenly I’m having trouble. I’ve tried moving the databases to the host VM running on Proxmox but I still can’t get my node back online.

2024-04-30 02:16:14,045 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-30 02:16:14,045 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-30 02:16:14,045 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-30T02:16:23Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file \"config/storage/garbage_collection_filewalker_progress.db\" failed: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:341\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:316\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:281\n\tmain.cmdRun:65\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267", "errorVerbose": "Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file \"config/storage/garbage_collection_filewalker_progress.db\" failed: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:341\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:316\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:281\n\tmain.cmdRun:65\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:267"}
Error: Error opening database on storagenode: database: garbage_collection_filewalker_progress opening file "config/storage/garbage_collection_filewalker_progress.db" failed: database is locked
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364
        storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:341
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:316
        storj.io/storj/storagenode/storagenodedb.OpenExisting:281
        main.cmdRun:65
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:393
        storj.io/common/process.cleanup.func1:411
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:267

arrogantrabbit · April 30, 2024, 2:59am

SQLite over SMB is a horrible idea. Don’t do that. Read section 2.1 here: How To Corrupt An SQLite Database File

Storagenode over SMB is not supported either. The only exception is iSCSi.

Run storagenode directly on truenas.

but it worked for ages?

Well, even broken clock show correct time twice a day. It broke now as the traffic picked up and all the concurrency issues got a better chance to manifest themselves.

Move storagenode to TrueNAS.

Alexey · May 5, 2024, 1:44pm

Hello @Derezzer,
Welcome to the forum!

Any network filesystems are not supported, include but not limited to SMB/CIFS, NFS, SSHFS, etc.
The fact that it somehow worked before doesn’t mean that it would work in the future.
I agree with @arrogantrabbit , you need to move your node to the TrueNAS directly. Preferable - as a container, not a VM.

JeanS · May 5, 2024, 6:30pm

I moved StorJ to TrueNAS VM and using TrueNAS ZVol as a disk. No issues since, and logs are having only INFO traces. I installed also Portainer, so it is easy to follow status of storagenode.

Whole Proxmox where TrueNAS was running ‘died’ few days ago, and I got some downtime, but this is not related to StorJ/TrueNAS directly.

Alexey · May 7, 2024, 6:52am

But why wasting resources on VM? The container has much less overhead, you likely can run it as a container on Proxmox directly…