Yet another node offline after update thread

Have you tried this ? http://www.timesynctool.com/

Yet another reason to use Linux with ntp. But even Windows has its own ntp service: Configuring Systems for High Accuracy | Microsoft Learn

As I wrote, “out of sync” looks like a consequence of killed init process of storagenode. Ntp is up and running, after manual start of the container, the node is happy with date and time.
Please don’t recommend me Windows :joy:

1 Like

I need to note, the restart policy of container is set to “unless-stopped”, which is also confirmed by “docker inspect” command. But storagenode process was not restarted when stopped, maybe because storagenode-updater have been up, so container stayed up?

I will run “docker inspect” command after that happens again, but I don’t expect different result.

Thx @BrightSilence for the idea.

Looks like some race condition.
It should

  1. download the storagenode binary and restart it,
  2. check for a new versions for storagenode and storagenode-updater,
  3. download a new version of storagenode and restart it
  4. download a new version of storagenode-updater and restart it.

but somewhere the updater did not wait for the finish of the update process and restarted storagenode and itself in a wrong time, so no one service was remained.
The weird thing that why docker did not restart a stopped container?

Yeah, the container was stopped when I started to look for the problem. Is it common, that two archives are downloaded at the same time and relevant processes are restarted - storagenode_linux_amd64.1183188037.zip and storagenode-updater_linux_amd64.4176807791.zip?
Edit: Sorry, the answer is allready in your post @Alexey

Do you have any idea, why updater can stop with error? I tried to look on the source on github, but I am not Go person :slightly_smiling_face:

2023-05-19 08:49:06,105 INFO exited: storagenode-updater (exit status 1; not expected)

I have checked free space on /tmp, few GB there, it is not a ram disk.

yes, my docker node updated without any issues. I have another one, which is still on the old version. Will check it too.

nope. Perhaps it was corrupted or improperly unpacked.

docker uses an own namespace for temp files. But I suppose they should be on /tmp anyway.

The best guess what I have now - the updater managed to get asynchronouse update somehow.

So it happened again, different node, looks like similar situation. It is driving me nuts, manual interception is needed and it is making my online score not 100% :slight_smile:

exited: storagenode-updater (exit status 1; not expected)

@Alexey do you think I should open an issue report on github?

I forgot to “docker inspect” stopped container.

docker log

2023-05-30T12:23:59.159Z INFO Current binary version {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”, “Version”: “v1.78.3”}

2023-05-30T12:23:59.159Z INFO New version is being rolled out but hasn’t made it to this node yet {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”}

2023-05-30T12:38:53.725Z INFO Downloading versions. {“Process”: “storagenode-updater”, “Server Address”: “https://version.storj.io”}

2023-05-30T12:38:59.158Z INFO Current binary version {“Process”: “storagenode-updater”, “Service”: “storagenode”, “Version”: “v1.78.3”}

2023-05-30T12:38:59.158Z INFO Download started. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.79.4/storagenode_linux_amd64.zip”, “To”: “/tmp/storagenode_linux_amd64.3880153684.zip”}

2023-05-30T12:39:10.244Z INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.79.4/storagenode_linux_amd64.zip”, “To”: “/tmp/storagenode_linux_amd64.3880153684.zip”}

2023-05-30T12:39:10.271Z INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode”}

2023-05-30T12:39:10.523Z INFO Service restarted successfully. {“Process”: “storagenode-updater”, “Service”: “storagenode”}

2023-05-30T12:39:10.529Z INFO Current binary version {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”, “Version”: “v1.78.3”}

2023-05-30T12:39:10.530Z INFO Download started. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.79.4/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.2856344600.zip”}

2023-05-30T12:39:10.966Z INFO Download finished. {“Process”: “storagenode-updater”, “From”: “https://github.com/storj/storj/releases/download/v1.79.4/storagenode-updater_linux_amd64.zip”, “To”: “/tmp/storagenode-updater_linux_amd64.2856344600.zip”}

2023-05-30T12:39:10.978Z INFO Restarting service. {“Process”: “storagenode-updater”, “Service”: “storagenode-updater”}

2023-05-30 12:39:10,985 INFO exited: storagenode-updater (exit status 1; not expected)

2023-05-30 12:39:11,324 INFO spawned: ‘storagenode-updater’ with pid 14031

2023-05-30 12:39:11,325 INFO exited: storagenode (exit status 0; expected)

2023-05-30T12:39:11.345Z INFO Configuration loaded {“Process”: “storagenode-updater”, “Location”: “/app/config/config.yaml”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “healthcheck.details”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “server.private-address”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “server.address”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “healthcheck.enabled”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “operator.email”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “operator.wallet”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “contact.external-address”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “storage.allocated-disk-space”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “storage.allocated-bandwidth”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “operator.wallet-features”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file key {“Process”: “storagenode-updater”, “Key”: “console.address”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file value for key {“Process”: “storagenode-updater”, “Key”: “log.caller”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file value for key {“Process”: “storagenode-updater”, “Key”: “log.level”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file value for key {“Process”: “storagenode-updater”, “Key”: “log.output”}

2023-05-30T12:39:11.345Z INFO Invalid configuration file value for key {“Process”: “storagenode-updater”, “Key”: “log.stack”}

2023-05-30 12:39:11,347 INFO spawned: ‘storagenode’ with pid 14037

2023-05-30 12:39:11,347 WARN received SIGQUIT indicating exit request

2023-05-30 12:39:11,350 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die

2023-05-30 12:39:11,354 INFO stopped: storagenode-updater (terminated by SIGTERM)

2023-05-30 12:39:11,355 INFO stopped: storagenode (terminated by SIGTERM)

2023-05-30 12:39:11,357 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

node log

2023-05-30T12:39:11.180Z ERROR collector unable to update piece info {“Process”: “storagenode”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “UL3JU5D6LVSZBQXYOFZCYAW6I5TO4DNRBONQ2WBNTXANT34H3QIQ”, “error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).DeleteFailed:99\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteFailed:567\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:109\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.180Z ERROR collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “UL3JU5D6LVSZBQXYOFZCYAW6I5TO4DNRBONQ2WBNTXANT34H3QIQ”, “error”: “pieces error: context canceled; v0pieceinfodb: context canceled”, “errorVerbose”: “pieces error: context canceled; v0pieceinfodb: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteExpired:356\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:335\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.182Z ERROR collector unable to update piece info {“Process”: “storagenode”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Piece ID”: “HSSDPSBSI2UQBQ3QMMZZAC6NQKLRVT4XJUOOXU2MAQYZJPR5DQJA”, “error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).DeleteFailed:99\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteFailed:567\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:109\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.182Z ERROR collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Piece ID”: “HSSDPSBSI2UQBQ3QMMZZAC6NQKLRVT4XJUOOXU2MAQYZJPR5DQJA”, “error”: “pieces error: context canceled; v0pieceinfodb: context canceled”, “errorVerbose”: “pieces error: context canceled; v0pieceinfodb: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteExpired:356\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:335\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.223Z ERROR collector unable to update piece info {“Process”: “storagenode”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Piece ID”: “W6M2WVBWCU6ZAFIEEXUT6HXR3D4AT3JIZHCCNEEDCEP73VBLYB4Q”, “error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).DeleteFailed:99\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteFailed:567\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:109\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.223Z ERROR collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Piece ID”: “W6M2WVBWCU6ZAFIEEXUT6HXR3D4AT3JIZHCCNEEDCEP73VBLYB4Q”, “error”: “pieces error: context canceled; v0pieceinfodb: context canceled”, “errorVerbose”: “pieces error: context canceled; v0pieceinfodb: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteExpired:356\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:335\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.244Z ERROR collector unable to update piece info {“Process”: “storagenode”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Piece ID”: “WLNTC5PPL2IBCGP6ZUXBQWVSFTF26MZTERZ4SLCKU6SX556HTQ4A”, “error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).DeleteFailed:99\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteFailed:567\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:109\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.244Z ERROR collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Piece ID”: “WLNTC5PPL2IBCGP6ZUXBQWVSFTF26MZTERZ4SLCKU6SX556HTQ4A”, “error”: “pieces error: context canceled; v0pieceinfodb: context canceled”, “errorVerbose”: “pieces error: context canceled; v0pieceinfodb: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteExpired:356\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:335\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

2023-05-30T12:39:11.244Z INFO collector collect {“Process”: “storagenode”, “count”: 3}

2023-05-30T12:39:11.244Z ERROR collector error during collecting pieces: {“Process”: “storagenode”, “error”: “pieceexpirationdb: context canceled”, “errorVerbose”: “pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:39\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:541\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:88\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

I have had such an issue in the past too.
I don’t have a log, so I can’t say if I had exactly the same message.
But it happened more than a couple of times and it is risky because updates are frequent.
Every time after it has happened I had to simply start the node manually and there was never an issue starting it. It just started and ran without problems.
So yes, maybe there is an issue with the updater that makes the restart after update fail some times.

Yes, I think so, the issue on GitHub will allow to collect needed information and could be taken into work.

I do experience the same time to time, but I could not find why it’s happening or why docker containers do not restart when that happens, despite being configuredto restart “unless-stopped”. Actually they’re even configured to “always-restart” now, but it does not help.

I understand that this won’t help you directly. Though, I wanted to say that for the three years that I am operating a node, I’ve never had this kind of problem. Likely different setup, hardware, software, etc. it’s just never happened to me that an upgrade was broken… except for one time where it was clear from the logs what happened. For me the Storj-provided software was very reliable.

I hope you will figure it out!

1 Like

yeah, thank you for the follow up. These problems with time hazard (which is the thing, I believe) can be triggered by very unrelated parameters and looks like I have a “correct environment” for the problem now :slight_smile:

I have opened the issue on GitHub, hopefully under the correct project.

1 Like

aaand i have the same problem suddenly,
i “reproduced it” on WindowsGUI
its because i changed location and time zone of the node.
Got new location (which is +1 hour different to previous.)
new internet, and new ip here now, and i can not return to old one.
Trying to run node in new location,
Im pretty sure all the files are okay,
(it just got cut out form the internet during STORJ running, and can’t be back on old connection)
is theres some way to flush catche on sattelites? im desperatly trying to run it, to stop loosing Online time, but its erroring me in the log for “system clock is out of sync with all trusted sattelites”,
i mannualy update to some 1.80.something ver but stilll…
im scanning disk for errors now… C got none, now Storj disk…
WindowsGui

if the only error is “clock out of sync”, then you may sync them (disable sync with internet, apply, enable sync with internet, apply) or even run a nettime (https://www.timesynctool.com/).
But if you have some other errors before that, than the root cause likely not related to system clocks.

7 posts were split to a new topic: Tcp connector failed: rpc: dial tcp 34.159.134.91:7777: connectex: No connection could be made because the target machine actively refused it

Just for the reference on OP, since I did not get any reply on bug report, I had to add very nasty command to my crontab. It will start all stopped containers with name matching reqex. My storagenode containers are namen sn%d. Hope it will be diagnosed sometime. It allready started storagenode container twice.

docker start $(docker ps --format “{{.Names}}” --filter “status=exited” --filter “name=^sn\d+$”) > /dev/null

Did you try to use --restart always policy in your docker run command?

Not on current system, but I did on older servers with no success. I also can’t believe that :slight_smile:
I think my docker setup around storagenode triggers the problem, I am disabling the containers trying to see when the problem will be gone, but it will take months to get the results.
What I mean by that, I have these containers for each node:

  • storjlabs/storagenode (logs to file node.log)
  • anclrii/storj-exporter (linked to storagenode container)
  • storj-log-exporter (grok exporter) (bound to storagenode path and node.log)
  • linkyard/logrotate (bound to storagenode path and node.log)

Maybe some scenario when storj-exporter pulls the data and update happens to stop storagenode at the same time?

Unlikely. But definitely some race condition, rare and hard to reproduce.
However I did not figure out why the docker daemon does not restart the stopped container in this case.