Nodes restarting every few minutes

storaje · September 7, 2024, 5:13pm

All of my nodes have been stuck in a simultaneous crash loop this morning. The only thing amiss in the logs is a failure of the version downloader. They’re finally up more than 10 minutes now but I’m not holding my breath.

2024-09-07T16:03:57Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-09-07T16:16:48Z    ERROR   Error retrieving version info.  {"Process": "storagenode-updater", "error": "version checker client: Get \"https://version.storj.io\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "version checker client: Get \"https://version.storj.io\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tmain.loopFunc:20\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tmain.cmdRun:138\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tmain.main:22\n\truntime.main:271"}
2024-09-07T16:18:38Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-09-07T16:18:40Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.111.4"}
2024-09-07T16:18:40Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-09-07T16:18:40Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.111.4"}
2024-09-07T16:18:40Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-09-07T16:34:08Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-09-07T16:39:03Z    ERROR   Error retrieving version info.  {"Process": "storagenode-updater", "error": "version checker client: Get \"https://version.storj.io\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "version checker client: Get \"https://version.storj.io\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tmain.loopFunc:20\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tmain.cmdRun:138\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tmain.main:22\n\truntime.main:271"}
2024-09-07 16:39:05,406 INFO exited: storagenode (exit status 1; not expected)
2024-09-07 16:39:06,594 INFO spawned: 'storagenode' with pid 366
2024-09-07 16:39:06,594 WARN received SIGQUIT indicating exit request
2024-09-07 16:39:06,595 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2024-09-07T16:39:06Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-09-07 16:39:06,712 INFO stopped: storagenode-updater (exit status 0)
2024-09-07 16:39:06,714 INFO stopped: storagenode (terminated by SIGTERM)
2024-09-07 16:39:06,714 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

Knowledge · September 7, 2024, 5:51pm

Are they stable after almost an hour now?

storaje · September 7, 2024, 6:16pm

Until 20 minutes ago I got the same error and a reset.

Alexey · September 8, 2024, 1:55am

Seems the network is disappearing from time to time, or something blocking connections.
Please disable a “smart” advanced security on your router/ISP. If you have a PiHole, try to disable it too.

storaje · September 8, 2024, 2:04am

Is it expected behavior for the node to restart itself if it cannot reach the version endpoint? Or just a coincidence and another symptom of the real problem?

Alexey · September 8, 2024, 3:47am

I do not think that it was the reason. I do not see any messages from the node itself, perhaps you redirected logs to the file. In that case I would suggest checking the node log file to see if there were other issues?