Node random crashes and took 1-2 hours to work again

Hello there.
I run a storj node since a few months.
I run the node with podman quadlets, and it is setup to restart automatically if it crashs.
The problem I encounter is that sometimes it randomly crash, and each restart try fails with a timeout and no app logs are present, and that occurs in a restart loop during ~1-2 hours, and then suddenly the next restart is a success and no more starting timeout.

Some error logs just before the crash of this morning:

févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        services        unexpected shutdown of a runner        {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of stor>
févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        piecestore        error sending hash and order limit        {"Process": "storagenode", "Piece ID": "HRQ4SPX7ECCX3EXUYOUYULWFAUAWPOAUGQX4FFIQYJR7U5GQWCCQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1nt>
févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        piecestore        download failed        {"Process": "storagenode", "Piece ID": "HRQ4SPX7ECCX3EXUYOUYULWFAUAWPOAUGQX4FFIQYJR7U5GQWCCQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk>
...
févr. 22 07:37:13 fedora-server storj[9206]: 2025-02-22T06:37:13Z        ERROR        failure during run        {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed >
févr. 22 07:37:13 fedora-server storj[9206]: Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory
févr. 22 07:37:13 fedora-server storj[9206]: 2025-02-22 06:37:13,348 WARN exited: storagenode (exit status 1; not expected)
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,397 INFO spawned: 'storagenode' with pid 3858
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Configuration loaded        {"Process": "storagenode", "Location": "/app/config/config.yaml"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,516 WARN received SIGQUIT indicating exit request
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,517 INFO waiting for processes-exit-eventlistener, storagenode, storagenode-updater to die
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Anonymized tracing enabled        {"Process": "storagenode"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Operator email        {"Process": "storagenode", "Address": "XXX"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Operator wallet        {"Process": "storagenode", "Address": "XXX"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Got a signal from the OS: "terminated"        {"Process": "storagenode-updater"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,583 INFO stopped: storagenode-updater (exit status 0)
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Got a signal from the OS: "terminated"        {"Process": "storagenode"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        ERROR        failure during run        {"Process": "storagenode", "error": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/stor>
févr. 22 07:37:14 fedora-server storj[9206]: Error: Error opening database on storagenode: database: info opening file "config/storage/info.db" failed: context canceled

The restart timeouts during 2 hours:

févr. 22 07:37:15 fedora-server systemd[1499]: storj.service: Consumed 27min 55.252s CPU time, 2.9G memory peak, 56.8M memory swap peak.
févr. 22 07:37:15 fedora-server systemd[1499]: storj.service: Scheduled restart job, restart counter is at 4.
févr. 22 07:37:15 fedora-server systemd[1499]: Starting storj.service - Podman Storj container...
févr. 22 07:37:15 fedora-server podman[724841]: 2025-02-22 07:37:15.714024054 +0100 CET m=+0.082077070 container create 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9 (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:37:15 fedora-server podman[724841]: 2025-02-22 07:37:15.668384797 +0100 CET m=+0.036437824 image pull 4c1e65b147c42b70a513cadd66d660871aa7f3d644fb2dde89a5210535d48c9d docker.io/storjlabs/storagenode:latest
févr. 22 07:42:15 fedora-server systemd[1499]: storj.service: start operation timed out. Terminating.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Main process exited, code=exited, status=1/FAILURE
févr. 22 07:42:16 fedora-server podman[726106]: 2025-02-22 07:42:16.328401658 +0100 CET m=+0.258244429 container remove 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9 (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:42:16 fedora-server storj[726106]: 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Failed with result 'timeout'.
févr. 22 07:42:16 fedora-server systemd[1499]: Failed to start storj.service - Podman Storj container.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Consumed 26.778s CPU time, 3.9G memory peak.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Scheduled restart job, restart counter is at 5.
févr. 22 07:42:16 fedora-server systemd[1499]: Starting storj.service - Podman Storj container...
févr. 22 07:42:16 fedora-server podman[726204]: 2025-02-22 07:42:16.697181174 +0100 CET m=+0.070753503 container create b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:42:16 fedora-server podman[726204]: 2025-02-22 07:42:16.646537597 +0100 CET m=+0.020109926 image pull 4c1e65b147c42b70a513cadd66d660871aa7f3d644fb2dde89a5210535d48c9d docker.io/storjlabs/storagenode:latest
févr. 22 07:47:16 fedora-server systemd[1499]: storj.service: start operation timed out. Terminating.
févr. 22 07:47:16 fedora-server systemd[1499]: storj.service: Main process exited, code=exited, status=1/FAILURE
févr. 22 07:47:17 fedora-server podman[727469]: 2025-02-22 07:47:17.306881096 +0100 CET m=+0.334068440 container remove b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:47:17 fedora-server storj[727469]: b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a
févr. 22 07:47:17 fedora-server systemd[1499]: storj.service: Failed with result 'timeout'.
févr. 22 07:47:17 fedora-server systemd[1499]: Failed to start storj.service - Podman Storj container.
févr. 22 07:47:17 fedora-server systemd[1499]: storj.service: Consumed 37.374s CPU time, 4.6G memory peak.

I put the full logs (huge) in the link below, that contains logs before the crash, the 2 hours restart tries, and the logs when the service succeed to get up again:
https://pastebin.com/raw/zk7t7vep

Thanks a lot for your help

Not sure if this is related, but I’m experiencing a similar issue. In my case, it says that it can’t find the info.db at all:

2025-02-22 16:30:40,224 INFO spawned: 'storagenode' with pid 110
2025-02-22T16:30:40Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2025-02-22T16:30:40Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2025-02-22T16:30:40Z    INFO    Operator email  {"Process": "storagenode", "Address": "redacted"}
2025-02-22T16:30:40Z    INFO    Operator wallet {"Process": "storagenode", "Address": "redacted"}
2025-02-22T16:30:40Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272", "errorVerbose": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}
Error: Error opening database on storagenode: database: info opening file "config/storage/info.db" failed: unable to open database file: no such file or directory
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367
        storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337
        storj.io/storj/storagenode/storagenodedb.OpenExisting:302
        main.cmdRun:67
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:392
        storj.io/common/process.cleanup.func1:410
        github.com/spf13/cobra.(*Command).execute:985
        github.com/spf13/cobra.(*Command).ExecuteC:1117
        github.com/spf13/cobra.(*Command).Execute:1041
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:272
2025-02-22 16:30:40,317 WARN exited: storagenode (exit status 1; not expected)
2025-02-22 16:30:41,318 INFO gave up: storagenode entered FATAL state, too many start retries too quickly

I docker exec’d into a shell in the container and confirmed that info.db is accessible and readable.

How is storage mounted? Looks like a race between application startup and storage being available. Or just flaky storage.

The storage is btrfs filesystem mounted via fstab (its a proxmox virtual disk)
and then bound to the container when started (/spool/storage/storj).
here is my podman quadlet file if it can help:

[Unit]
Description=Podman Storj container

[Container]
ContainerName=storj
Image=docker.io/storjlabs/storagenode:latest
Volume=/apps/storj/identity/storagenode:/app/identity:Z
Volume=/spool/storage/storj:/app/config:Z
PublishPort=28967:28967/tcp
PublishPort=28967:28967/udp
PublishPort=6008:14002
Environment=WALLET=XXX
Environment=EMAIL=XXX
Environment=ADDRESS=XXX.XX:28967
Environment=STORAGE=10TB

[Service]
Restart=always

[Install]
WantedBy=default.target


Hello @Skysor,
Welcome to the forum!

The reason is

This is mean that your disk is too slow to write a small file to the disk.
See

Unfortunately btrfs is known as too slow for storagenode with a default piecestore.
You may try to improve it, there are several methods:

Hello @Alexey
Thank you for your help,

After reading your links, I just enabled the badger cache and increased the writeable-dir timeout to 3m (and systemd TimeoutStartSec=300)
I will keep you informed to know if the issue if fixed

1 Like