Node random crashes and took 1-2 hours to work again

Skysor · February 22, 2025, 9:56am

Hello there.
I run a storj node since a few months.
I run the node with podman quadlets, and it is setup to restart automatically if it crashs.
The problem I encounter is that sometimes it randomly crash, and each restart try fails with a timeout and no app logs are present, and that occurs in a restart loop during ~1-2 hours, and then suddenly the next restart is a success and no more starting timeout.

Some error logs just before the crash of this morning:

févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        services        unexpected shutdown of a runner        {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of stor>
févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        piecestore        error sending hash and order limit        {"Process": "storagenode", "Piece ID": "HRQ4SPX7ECCX3EXUYOUYULWFAUAWPOAUGQX4FFIQYJR7U5GQWCCQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1nt>
févr. 22 07:35:21 fedora-server storj[9206]: 2025-02-22T06:35:21Z        ERROR        piecestore        download failed        {"Process": "storagenode", "Piece ID": "HRQ4SPX7ECCX3EXUYOUYULWFAUAWPOAUGQX4FFIQYJR7U5GQWCCQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk>
...
févr. 22 07:37:13 fedora-server storj[9206]: 2025-02-22T06:37:13Z        ERROR        failure during run        {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed >
févr. 22 07:37:13 fedora-server storj[9206]: Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory
févr. 22 07:37:13 fedora-server storj[9206]: 2025-02-22 06:37:13,348 WARN exited: storagenode (exit status 1; not expected)
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,397 INFO spawned: 'storagenode' with pid 3858
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Configuration loaded        {"Process": "storagenode", "Location": "/app/config/config.yaml"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,516 WARN received SIGQUIT indicating exit request
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,517 INFO waiting for processes-exit-eventlistener, storagenode, storagenode-updater to die
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Anonymized tracing enabled        {"Process": "storagenode"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Operator email        {"Process": "storagenode", "Address": "XXX"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Operator wallet        {"Process": "storagenode", "Address": "XXX"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Got a signal from the OS: "terminated"        {"Process": "storagenode-updater"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22 06:37:14,583 INFO stopped: storagenode-updater (exit status 0)
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        INFO        Got a signal from the OS: "terminated"        {"Process": "storagenode"}
févr. 22 07:37:14 fedora-server storj[9206]: 2025-02-22T06:37:14Z        ERROR        failure during run        {"Process": "storagenode", "error": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/stor>
févr. 22 07:37:14 fedora-server storj[9206]: Error: Error opening database on storagenode: database: info opening file "config/storage/info.db" failed: context canceled

The restart timeouts during 2 hours:

févr. 22 07:37:15 fedora-server systemd[1499]: storj.service: Consumed 27min 55.252s CPU time, 2.9G memory peak, 56.8M memory swap peak.
févr. 22 07:37:15 fedora-server systemd[1499]: storj.service: Scheduled restart job, restart counter is at 4.
févr. 22 07:37:15 fedora-server systemd[1499]: Starting storj.service - Podman Storj container...
févr. 22 07:37:15 fedora-server podman[724841]: 2025-02-22 07:37:15.714024054 +0100 CET m=+0.082077070 container create 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9 (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:37:15 fedora-server podman[724841]: 2025-02-22 07:37:15.668384797 +0100 CET m=+0.036437824 image pull 4c1e65b147c42b70a513cadd66d660871aa7f3d644fb2dde89a5210535d48c9d docker.io/storjlabs/storagenode:latest
févr. 22 07:42:15 fedora-server systemd[1499]: storj.service: start operation timed out. Terminating.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Main process exited, code=exited, status=1/FAILURE
févr. 22 07:42:16 fedora-server podman[726106]: 2025-02-22 07:42:16.328401658 +0100 CET m=+0.258244429 container remove 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9 (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:42:16 fedora-server storj[726106]: 89f411b3c06286f5408a2dab845574bc9a2480203827821cb272f9ce92ecb1c9
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Failed with result 'timeout'.
févr. 22 07:42:16 fedora-server systemd[1499]: Failed to start storj.service - Podman Storj container.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Consumed 26.778s CPU time, 3.9G memory peak.
févr. 22 07:42:16 fedora-server systemd[1499]: storj.service: Scheduled restart job, restart counter is at 5.
févr. 22 07:42:16 fedora-server systemd[1499]: Starting storj.service - Podman Storj container...
févr. 22 07:42:16 fedora-server podman[726204]: 2025-02-22 07:42:16.697181174 +0100 CET m=+0.070753503 container create b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:42:16 fedora-server podman[726204]: 2025-02-22 07:42:16.646537597 +0100 CET m=+0.020109926 image pull 4c1e65b147c42b70a513cadd66d660871aa7f3d644fb2dde89a5210535d48c9d docker.io/storjlabs/storagenode:latest
févr. 22 07:47:16 fedora-server systemd[1499]: storj.service: start operation timed out. Terminating.
févr. 22 07:47:16 fedora-server systemd[1499]: storj.service: Main process exited, code=exited, status=1/FAILURE
févr. 22 07:47:17 fedora-server podman[727469]: 2025-02-22 07:47:17.306881096 +0100 CET m=+0.334068440 container remove b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a (image=docker.io/storjlabs/storagenode:latest, name=storj, PODMAN_SYSTEMD_UNIT=storj.service)
févr. 22 07:47:17 fedora-server storj[727469]: b3baa5447a2276b448b6184278468445219b609c6cc986ed1e00aeed6450703a
févr. 22 07:47:17 fedora-server systemd[1499]: storj.service: Failed with result 'timeout'.
févr. 22 07:47:17 fedora-server systemd[1499]: Failed to start storj.service - Podman Storj container.
févr. 22 07:47:17 fedora-server systemd[1499]: storj.service: Consumed 37.374s CPU time, 4.6G memory peak.

I put the full logs (huge) in the link below, that contains logs before the crash, the 2 hours restart tries, and the logs when the service succeed to get up again:
https://pastebin.com/raw/zk7t7vep

Thanks a lot for your help

mattventura · February 22, 2025, 4:37pm

Not sure if this is related, but I’m experiencing a similar issue. In my case, it says that it can’t find the info.db at all:

2025-02-22 16:30:40,224 INFO spawned: 'storagenode' with pid 110
2025-02-22T16:30:40Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2025-02-22T16:30:40Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2025-02-22T16:30:40Z    INFO    Operator email  {"Process": "storagenode", "Address": "redacted"}
2025-02-22T16:30:40Z    INFO    Operator wallet {"Process": "storagenode", "Address": "redacted"}
2025-02-22T16:30:40Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272", "errorVerbose": "Error opening database on storagenode: database: info opening file \"config/storage/info.db\" failed: unable to open database file: no such file or directory\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:272"}
Error: Error opening database on storagenode: database: info opening file "config/storage/info.db" failed: unable to open database file: no such file or directory
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367
        storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337
        storj.io/storj/storagenode/storagenodedb.OpenExisting:302
        main.cmdRun:67
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:392
        storj.io/common/process.cleanup.func1:410
        github.com/spf13/cobra.(*Command).execute:985
        github.com/spf13/cobra.(*Command).ExecuteC:1117
        github.com/spf13/cobra.(*Command).Execute:1041
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:272
2025-02-22 16:30:40,317 WARN exited: storagenode (exit status 1; not expected)
2025-02-22 16:30:41,318 INFO gave up: storagenode entered FATAL state, too many start retries too quickly

I docker exec’d into a shell in the container and confirmed that info.db is accessible and readable.

arrogantrabbit · February 22, 2025, 5:01pm

How is storage mounted? Looks like a race between application startup and storage being available. Or just flaky storage.

Skysor · February 22, 2025, 5:21pm

The storage is btrfs filesystem mounted via fstab (its a proxmox virtual disk)
and then bound to the container when started (/spool/storage/storj).
here is my podman quadlet file if it can help:

[Unit]
Description=Podman Storj container

[Container]
ContainerName=storj
Image=docker.io/storjlabs/storagenode:latest
Volume=/apps/storj/identity/storagenode:/app/identity:Z
Volume=/spool/storage/storj:/app/config:Z
PublishPort=28967:28967/tcp
PublishPort=28967:28967/udp
PublishPort=6008:14002
Environment=WALLET=XXX
Environment=EMAIL=XXX
Environment=ADDRESS=XXX.XX:28967
Environment=STORAGE=10TB

[Service]
Restart=always

[Install]
WantedBy=default.target

Alexey · February 23, 2025, 8:50am

Hello @Skysor,
Welcome to the forum!

The reason is

This is mean that your disk is too slow to write a small file to the disk.
See

Unfortunately btrfs is known as too slow for storagenode with a default piecestore.
You may try to improve it, there are several methods:

Skysor · February 23, 2025, 10:58am

Hello @Alexey
Thank you for your help,

After reading your links, I just enabled the badger cache and increased the writeable-dir timeout to 3m (and systemd TimeoutStartSec=300)
I will keep you informed to know if the issue if fixed

Skysor · April 12, 2025, 9:33am

Hello,
The previous issue seems to be resolved, but I have now another issue that happen randomly almost each day or 2: after starting the program, sometimes the logs only shows binary version, up to date, and downloading versions (see the attached full log) and no piece are downloaded, so the node comes offline soon after that.
Sometimes I have to restart manually multiples times until it works again (can be 5 tries, 1 hour …)
Logs are here: https://pastebin.com/raw/274Cx3k0
Please how can I fix this new issue ?
Thank you.

Alexey · April 13, 2025, 2:26am

I can see that the storagenode container is started, and then it should check-in on satellites to check clocks and provide the contact address. However, it hangs on connection to the satellites.
This is the last message from storagenode:

avril 12 07:21:29 fedora-server storj[8597]: 2025-04-12T05:21:29Z        INFO        preflight:localtime        start checking local system clock with trusted satellites' system clock.        {"Process": "storagenode"}

So, please try to figure out, why the connection to the satellites hangs for the container (or maybe for the whole VM), do you maybe have some blocking firewall? Or filter like PiHole or “smart” security on the router?

Please also try to delete the file trust-cache.json from the data location and restart the node.

Skysor · April 16, 2025, 2:22pm

I have an Suricata IPS active on the same VM that filters all network traffic and blocks suspicious packets, but I don’t see any blocked packet during the initialization of the storj container, so it isn’t the problem.
Also if the IPS was the problem, I don’ think storj would work again some time later after multiple restart attemps.

I tried to delete the trust-cache.json but it doesn’t seems to get any better, still stucked and I have to wait maybe 1h before it will start correctly again

arrogantrabbit · April 16, 2025, 2:42pm

Have you tried turning it off and checking, instead of speculating? Not everything is logged.

Suricata is 100% waste of time and energy unless you have a whole team on staff fine-tuning the rules for your environment and emerging threats all the time.

Skysor · April 16, 2025, 4:47pm

Hi, I just stopped suricata and removed nfqueue rules from my iptables, to ensure the IPS is not the cause, and it is not, the problem is still present. I manage this IPS since almost 1 year without major issues and it is configured well for my current apps and network flows, I don’t think its a waste of time, it blocks many blacklisted IP addresses, network scan patterns, etc.