Storagenode failed to start after auto update

mgonzalezm · June 15, 2022, 3:33pm

So I wake up this morning and found that one of my nodes have been offline for hours.
This is the relevant part of the logs:

2022-06-15T07:37:22.416Z	INFO	Downloading versions.	{"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2022-06-15T07:37:22.765Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.55.1"}
2022-06-15T07:37:22.773Z	INFO	Download started.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.56.4/storagenode_linux_arm64.zip", "To": "/tmp/storagenode_linux_arm64.913664633.zip"}
2022-06-15T07:37:27.481Z	INFO	Download finished.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.56.4/storagenode_linux_arm64.zip", "To": "/tmp/storagenode_linux_arm64.913664633.zip"}
2022-06-15T07:37:28.795Z	INFO	Restarting service.	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-06-15T07:37:30.221Z	INFO	Service restarted successfully.	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-06-15T07:37:30.272Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.55.1"}
2022-06-15T07:37:30.273Z	INFO	Download started.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.56.4/storagenode-updater_linux_arm64.zip", "To": "/tmp/storagenode-updater_linux_arm64.2550076120.zip"}
2022-06-15 07:37:30,324 INFO exited: storagenode (exit status 0; expected)
2022-06-15 07:37:31,335 INFO spawned: 'storagenode' with pid 6943
2022-06-15 07:37:31,377 WARN received SIGQUIT indicating exit request
2022-06-15 07:37:31,379 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2022-06-15T07:37:31.380Z	INFO	Got a signal from the OS: "terminated"	{"Process": "storagenode-updater"}
2022-06-15T07:37:31.406Z	ERROR	Error updating service.	{"Process": "storagenode-updater", "Service": "storagenode-updater", "error": "context canceled; context canceled", "errorVerbose": "context canceled; context canceled\n\tmain.downloadBinary:59\n\tmain.update:39\n\tmain.loopFunc:32\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tmain.cmdRun:136\n\tstorj.io/private/process.cleanup.func1.4:372\n\tstorj.io/private/process.cleanup.func1:390\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomConfigAndLogger:93\n\tmain.main:20\n\truntime.main:255"}
2022-06-15 07:37:31,419 INFO stopped: storagenode-updater (exit status 0)
2022-06-15 07:37:31,429 INFO stopped: storagenode (terminated by SIGTERM)
2022-06-15 07:37:31,431 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

2022-06-15T07:37:30.121Z	INFO	piecedeleter	delete piece sent to trash	{"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "P647JPBYQKSR3O5NI2GIE5B2FEDCZK7EWG4DQ4T3VN3ZGGEPEFIA"}
2022-06-15T07:37:30.221Z	INFO	Got a signal from the OS: "interrupt"	{"Process": "storagenode"}
2022-06-15T07:37:30.232Z	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "R2BXYVISYFKXMLQCXCNGSDX5GEZCQXCJHGFNARFJEPJBPBVEJSNA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT_REPAIR", "Size": 1318912}

it seem that it tried to upgrade from v1.55.1 to v1.56.4 and failed somehow.
what should I do to avoid this?.

littleskunk · June 15, 2022, 3:44pm

Add the restart unless stopped policy to your docker run command

mgonzalezm · June 15, 2022, 4:47pm

right, thank you

20 characters

Pac · October 24, 2022, 10:03am

I’ve been facing a very similar issue on my nodes recently when updating from 1.64.1 to 1.65.1, most of my nodes failed at restarting automatically after upgrading to this version.

Logs are almost identical to @mgonzalezm’s except for “storagenode-updater (exit status 1)” (instead of 0).

Docker logs:

pac@raspberrypi:~ $ sudo docker logs -f storj_node_3 -n 30
[...]
2022-10-24T09:05:02.904Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.64.1"}
2022-10-24T09:05:02.904Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-10-24T09:05:02.943Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.64.1"}
2022-10-24T09:05:02.943Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode-updater"}
2022-10-24T09:20:02.345Z	INFO	Downloading versions.	{"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2022-10-24T09:20:02.905Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.64.1"}
2022-10-24T09:20:02.909Z	INFO	Download started.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.65.1/storagenode_linux_arm64.zip", "To": "/tmp/storagenode_linux_arm64.607082644.zip"}
2022-10-24T09:20:05.177Z	INFO	Download finished.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.65.1/storagenode_linux_arm64.zip", "To": "/tmp/storagenode_linux_arm64.607082644.zip"}
2022-10-24T09:20:05.268Z	INFO	Restarting service.	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-10-24T09:20:06.103Z	INFO	Service restarted successfully.	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-10-24T09:20:06.147Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.64.1"}
2022-10-24T09:20:06.147Z	INFO	Download started.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.65.1/storagenode-updater_linux_arm64.zip", "To": "/tmp/storagenode-updater_linux_arm64.1694335755.zip"}
2022-10-24 09:20:06,622 INFO exited: storagenode (exit status 0; expected)
2022-10-24T09:20:07.045Z	INFO	Download finished.	{"Process": "storagenode-updater", "From": "https://github.com/storj/storj/releases/download/v1.65.1/storagenode-updater_linux_arm64.zip", "To": "/tmp/storagenode-updater_linux_arm64.1694335755.zip"}
2022-10-24 09:20:07,050 INFO spawned: 'storagenode' with pid 5235
2022-10-24 09:20:07,075 WARN received SIGQUIT indicating exit request
2022-10-24 09:20:07,076 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2022-10-24T09:20:07.081Z	INFO	Got a signal from the OS: "terminated"	{"Process": "storagenode-updater"}
2022-10-24T09:20:07.090Z	INFO	Restarting service.	{"Process": "storagenode-updater", "Service": "storagenode-updater"}
2022-10-24 09:20:07,101 INFO stopped: storagenode-updater (exit status 1)
2022-10-24 09:20:07,107 INFO stopped: storagenode (terminated by SIGTERM)
2022-10-24 09:20:07,108 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

Node logs:

[...]
2022-10-24T09:20:06.066Z	INFO	piecestore	downloaded	{"Process": "storagenode", "Piece ID": "KV4PJPU5RKE6HCCCGOXAVORVOF6ENKPZ374FQSEZXXV2HQSGTUHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2022-10-24T09:20:06.105Z	INFO	Got a signal from the OS: "interrupt"	{"Process": "storagenode"}
2022-10-24T09:20:06.112Z	INFO	piecestore	download canceled	{"Process": "storagenode", "Piece ID": "H2I7V74DAD47L7AMNZWE57KP4ZAOD6FFYHLFHWRW3CAZXKFPJROQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2022-10-24T09:20:06.156Z	INFO	piecestore	upload canceled	{"Process": "storagenode", "Piece ID": "RIE4Q4DNKSYSFK72LI4YUWTYPYFDEE7KA24XJCXCM76WO3UUQECQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 163840}

Docker events:

pac@raspberrypi:~ $ sudo docker events --since 2022-10-24T09:20:02.909Z | head -n 30
2022-10-24T11:20:07.508063314+02:00 container die 3ac0fcb82fc0271e7d35711029d88b0692c7fb7b7ab5bdbd75e480273c23bc7d (exitCode=0, image=storjlabs/storagenode:latest, name=storj_node_3)
2022-10-24T11:20:07.934061579+02:00 network disconnect e17977bbae8e6c56e76403e46379f999ed2a424b37e15d63d9885498bb1b3a2e (container=3ac0fcb82fc0271e7d35711029d88b0692c7fb7b7ab5bdbd75e480273c23bc7d, name=bridge, type=bridge)
2022-10-24T11:20:09.180567073+02:00 container die dea7c603b51e19c2a5a49bc510de0f3a4263058012a9125dca180061825cc9f2 (exitCode=0, image=storjlabs/storagenode:latest, name=storj_node_4)
2022-10-24T11:20:09.429352119+02:00 network disconnect e17977bbae8e6c56e76403e46379f999ed2a424b37e15d63d9885498bb1b3a2e (container=dea7c603b51e19c2a5a49bc510de0f3a4263058012a9125dca180061825cc9f2, name=bridge, type=bridge)
2022-10-24T11:20:09.761899922+02:00 network connect e17977bbae8e6c56e76403e46379f999ed2a424b37e15d63d9885498bb1b3a2e (container=dea7c603b51e19c2a5a49bc510de0f3a4263058012a9125dca180061825cc9f2, name=bridge, type=bridge)
2022-10-24T11:20:10.741520152+02:00 container start dea7c603b51e19c2a5a49bc510de0f3a4263058012a9125dca180061825cc9f2 (image=storjlabs/storagenode:latest, name=storj_node_4)
2022-10-24T11:20:15.175526520+02:00 container die 77224b7d863338193e885c5a5a8af203e2aecfa8290dcbc160ff05a755602fc8 (exitCode=0, image=storjlabs/storagenode:latest, name=storj_node_6)
2022-10-24T11:20:15.420014770+02:00 network disconnect e17977bbae8e6c56e76403e46379f999ed2a424b37e15d63d9885498bb1b3a2e (container=77224b7d863338193e885c5a5a8af203e2aecfa8290dcbc160ff05a755602fc8, name=bridge, type=bridge)

Docker daemon:

pac@raspberrypi:~/storj $ sudo systemctl status docker
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-10-20 21:47:14 CEST; 3 days ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 488 (dockerd)
      Tasks: 208
        CPU: 3min 36.510s
     CGroup: /system.slice/docker.service
[...]

Restarting the node manually does work.
I double checked, and all my nodes do have the docker option --restart unless-stopped, so I’m confused

What’s even more confusing is that some of them did restart automatically, so… I really don’t know what’s up.

All this being said, because all my nodes did update now, and because I did face severe issues with my RPi recently (see Nodes did not restart automatically after probable power outage - #10 by Pac) so I’m not opening a new thread… Maybe it was just a fluke, time will tell. I’ll come back with a dedicated post if future updates keeps failing.

mgonzalezm · October 24, 2022, 4:50pm

What I learnt recently is that the parameter --restart is only valid until the point that you stop the container manually, after that you need to reapply it.

for example if for any reason you did a docker stop node followed by a docker start node you need to run docker update --restart=unless-stopped node for it to automatically restart when updated.

maybe it is you case?

Pac · October 24, 2022, 7:53pm

That’s good to know! oO I had no idea.

However I don’t think I did that. I use a custom shell script which always removes a node’s container and recreate it with all parameters and options, so I’m pretty sure I did not run any docker stopcommand.

I searched my history to be sure, and I could not find any match to docker stop.

Could another process do that? The node updater (I don’t believe that’s how it works)? The watchtower?

Anyway, thx for your insights @mgonzalezm

mihalko · November 18, 2022, 9:57pm

I have the same problem. But in my case, I probably stopped and started the container manually so that could trigger the problem.

Based on the Docker docs:

If you manually stop a container, its restart policy is ignored until the Docker daemon restarts or the container is manually restarted. This is another attempt to prevent a restart loop.

After manual stop, maybe it is better to run it by restarting, not starting?

docker restart (container)

I was not able to find more info. If that happens again, I will try to go with restart allways policy. That means it will be started also on docker daemon start (after manual stop), but that’s OK for me.

docker run --restart allways

Pac · November 19, 2022, 7:46am

That’s a good idea.

I keep having the issue on my end, but I still don’t get why as I never stop my containers.

Maybe I should try the “always” policy too.