Sotragenode continuously restart after auto update

Three days ago, I notified my docker container to keep restarting. I’ve run this node for 11 months, and have had no issues with it until now.
Is there any way I could bring this node back again?

2024-09-24 09:30:10,561 INFO Set uid to user 0 succeeded
2024-09-24 09:30:10,564 INFO RPC interface 'supervisor' initialized
2024-09-24 09:30:10,564 INFO supervisord started with pid 1
2024-09-24 09:30:11,566 INFO spawned: 'processes-exit-eventlistener' with pid 9
2024-09-24 09:30:11,568 INFO spawned: 'storagenode' with pid 10
2024-09-24 09:30:11,568 INFO spawnerr: command at '/app/config/bin/storagenode-updater' is not executable
2024-09-24T09:30:11Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-09-24T09:30:11Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2024-09-24T09:30:11Z    INFO    Operator email  {"Process": "storagenode", "Address": "***"}
2024-09-24T09:30:11Z    INFO    Operator wallet {"Process": "storagenode", "Address": "***"}
2024-09-24 09:30:12,681 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-09-24 09:30:12,681 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-09-24 09:30:12,681 INFO spawnerr: command at '/app/config/bin/storagenode-updater' is not executable
2024-09-24T09:30:14Z    INFO    server  kernel support for tcp fast open unknown        {"Process": "storagenode"}
2024-09-24T09:30:14Z    INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "1XT***"}
2024-09-24T09:30:14Z    INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "1XT***"}
2024-09-24 09:30:14,724 INFO spawnerr: command at '/app/config/bin/storagenode-updater' is not executable
2024-09-24T09:30:16Z    INFO    db.migration    Database Version        {"Process": "storagenode", "version": 61}
2024-09-24T09:30:17Z    INFO    preflight:localtime     start checking local system clock with trusted satellites' system clock.        {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    preflight:localtime     local system clock is in sync with trusted satellites' system clock.    {"Process": "storagenode"}
2024-09-24 09:30:17,909 INFO spawnerr: command at '/app/config/bin/storagenode-updater' is not executable
2024-09-24 09:30:17,909 INFO gave up: storagenode-updater entered FATAL state, too many start retries too quickly
2024-09-24T09:30:17Z    INFO    Node 1XT*** started {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    Public server started on [::]:28967     {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    Private server started on 127.0.0.1:7778        {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    trust   Scheduling next refresh {"Process": "storagenode", "after": "5h52m48.801279588s"}
2024-09-24T09:30:17Z    INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "C4EOZECC3VP7VAOIE2QEUXFMNYAH2RDXSEMBYXWFEQQQOD6NMDJQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 362240, "Size": 181504, "Remote Address": "172.17.0.1:46124"}
2024-09-24 09:30:17,910 WARN received SIGQUIT indicating exit request
2024-09-24 09:30:17,910 INFO waiting for processes-exit-eventlistener, storagenode to die
2024-09-24T09:30:17Z    INFO    collector       expired pieces collection started       {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-09-24T09:30:17Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-09-13T17:59:59Z", "Filter Size": 16267347, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-09-24T09:30:17Z    INFO    bandwidth       Persisting bandwidth usage cache to db  {"Process": "storagenode"}
2024-09-24T09:30:17Z    INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess     {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-09-24T09:30:17Z    ERROR   lazyfilewalker.trash-cleanup-filewalker failed to start subprocess      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-09-24T09:30:17Z    ERROR   servers unexpected shutdown of a runner {"Process": "storagenode", "name": "debug", "error": "debug: listener closed", "errorVerbose": "debug: listener closed\n\tstorj.io/drpc/drpcmigrate.init:17\n\truntime.doInit1:7176\n\truntime.doInit:7143\n\truntime.main:253"}
2024-09-24T09:30:17Z    ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: lazyfilewalker: context canceled", "errorVerbose": "pieces error: lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:195\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:436\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:84\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2024-09-24T09:30:17Z    ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:197\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:17Z    ERROR   piecestore:cache        error during init space usage db:       {"Process": "storagenode", "error": "piece space used: context canceled", "errorVerbose": "piece space used: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).Init:55\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:60\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:17Z    ERROR   collector       error during expired pieces collection  {"Process": "storagenode", "count": 0, "error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:17Z    ERROR   nodestats:cache Get pricing-model/join date failed      {"Process": "storagenode", "error": "context canceled"}
2024-09-24T09:30:17Z    ERROR   version failed to get process version info      {"Process": "storagenode", "error": "version checker client: Get \"https://version.storj.io\": context canceled", "errorVerbose": "version checker client: Get \"https://version.storj.io\": context canceled\n\tstorj.io/storj/private/version/checker.(*Client).All:68\n\tstorj.io/storj/private/version/checker.(*Client).Process:89\n\tstorj.io/storj/private/version/checker.(*Service).checkVersion:104\n\tstorj.io/storj/private/version/checker.(*Service).CheckVersion:78\n\tstorj.io/storj/storagenode/version.(*Chore).checkVersion:115\n\tstorj.io/storj/storagenode/version.(*Chore).RunOnce:71\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/version.(*Chore).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:17Z    ERROR   collector       error during collecting pieces:         {"Process": "storagenode", "error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).getExpiredPaginated:103\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:66\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpiredBatchSkipV0:612\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:17Z    ERROR   gracefulexit:blobscleaner       couldn't receive satellite's GE status  {"Process": "storagenode", "error": "context canceled"}
2024-09-24T09:30:17Z    ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "C4EOZECC3VP7VAOIE2QEUXFMNYAH2RDXSEMBYXWFEQQQOD6NMDJQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 362240, "Size": 181504, "Remote Address": "172.17.0.1:46124", "error": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:146\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:64\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:666\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:302\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-09-24T09:30:17Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-09-24T09:30:17Z    ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "C4EOZECC3VP7VAOIE2QEUXFMNYAH2RDXSEMBYXWFEQQQOD6NMDJQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 724480, "Size": 181504, "Remote Address": "172.17.0.1:46154", "error": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:146\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:64\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:666\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:302\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:62\n\tstorj.io/common/experiment.(*Handler).HandleRPC:43\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:166\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:108\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:156\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2024-09-24T09:30:18Z    ERROR   lazyfilewalker.gc-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-09-24T09:30:18Z    ERROR   pieces  lazyfilewalker failed   {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:163\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:575\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:18Z    ERROR   filewalker      failed to get progress from database    {"Process": "storagenode"}
2024-09-24T09:30:18Z    ERROR   retain  retain pieces failed    {"Process": "storagenode", "cachePath": "config/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:582\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:380\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:265\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-09-24T09:30:19Z    ERROR   failure during run      {"Process": "storagenode", "error": "debug: listener closed", "errorVerbose": "debug: listener closed\n\tstorj.io/drpc/drpcmigrate.init:17\n\truntime.doInit1:7176\n\truntime.doInit:7143\n\truntime.main:253"}
Error: debug: listener closed
2024-09-24 09:30:19,536 WARN stopped: storagenode (exit status 1)
2024-09-24 09:30:19,537 WARN stopped: processes-exit-eventlistener (terminated by SIGTERM)

Hi @Mowd , it looks like this is related to the recent docker changes, see:

Or

1 Like

There is a message “INFO spawnerr: command at ‘/app/config/bin/storagenode-updater’ is not executable” in the log.
So I checked the files and the permission, storagenode is 100, and storagenode-updater is 000, which is incorrect.
After chmod to 755, everything is back to normal, thanks!

For such a breaking change, without any precaution before the auto-update, is a little bit unacceptable; it almost makes my suspension score drop below 70%.

1 Like

It’s a bug. Bugs happen. I recommend not automatically updating to any new software, and specifically, container image, unless you have time to ensure it works as expected. With auto-updates enabled you have to live on someone else’s unpredictable schedule. This is not sustainable.

Storj uses safe approach when rolling out storage node updates – they roll them out to QA satellite first, and then slowly and gradually deploy to the widening audience, in batches, so that if something goes sideways – the damage can be contained.

But if everyone autoupdates to the bad container – this mechanism is out of the window, the whole network can die. It’s a horrible idea. Because there is no gradual deployment mechanism for the container, image auto-update must be disabled.

And there is no point in updating docker image in the first place. Why would you do it? Node works, right? Why did you feed the need to update, let alone auto-update?

Well it still seems recommended according to official Storj documentation:

We recommend that you install Watchtower in the event that the base image needs updating.

2 Likes

Oops. That shall be fixed! Because of all the reasons above. Bad container update can nuke the whole network.

@Alexey ?

1 Like

I follow the storj official doc to install watchdog because very long ago there is a connecting issue of my container, then I realize I left my container way behind of the latest version.
Return to previous version of docker image is simple, but the problem is storj didn’t keep the images on the docker hub. The previous image on docker hub is 7 months ago which is not possible. If storj can preserve 1 or 2 image on docker hub, then things will go much easier and some of us won’t loose score.

Honestly, there are plenty of disadvantages of this “containerize everything” approaches. There are some good uses, but I’m yet to see benefits in the context of running nodes.

It adds overhead, it add points of failure, it adds this “update this entity through the process storj can’t control that can kill unknown numbers of nodes and pray” angle.

  • A lot of people run Docker on Windows. Docker on Windows is pure cancer. Why do that? Storj provides native windows executables.
  • Storagenode is written in go. Go produces self-contain monolythic executables with no external dependencies. Hence, dependencies isolation containerization provides is not beneficial.
  • Another benefit of containers – deployment and update pipiline – is not only not useful here, but dangerous. Controlling updates is so important that storagenode implements its own update deployment mechanism. In spite of docker’s existing all-now-or-never approach.
  • Then you have to deal with mounts and databases. Nobody guarantees sqlite integrity if it is used over mount to an unknown filesystem. Nobody. This is literally asking for corruption. I recommend this read: How To Corrupt An SQLite Database File

So, why use it in the first place? You are not creating cluster of micro services that can be moved around the cloud. You are running an app that is chained to the massive pool of storage, unique to it,it’s not going anywhere. It will always be here. You are not isolating dependencies, and not managing updates. Why is that monster in the way?

I don’t and never will use container when it’s not needed; for hosting storagenode specifically. I had no issues running it as-is, with rc.d. I did have issues with broken storagenode updater on freebsd, but I guess storj does not care about 4 freebsd users, that’s fine.

I recommend everyone do the same. Storj providing a container image is a copout. Here, we don’t trust you to read --help pages of our software, so just use a container. Instead, they could provide installer/updater packages for a couple of respective OSes. There are not many of those

  • macOS
  • FreeBSD
  • RedHat/Oracle/Fedora linux
  • Debian (doh! the pi crowd)
  • And ok, ok, Windows.

This eliminates all the issues above. Completely. Storagenode becomes first class citizen on the os, benefits from a scheduler, logging, filesystem, and stability of the host OS, and does not have an overhead of another VM (on windows an macOS) or unpredictable bugs deployment schedule storj can’t control.

I seriously strongly dislike when people shove containers where they don’t belong.

This is weird, because the entrypoint has an explicit chmod u+x command:

So I would like to know, how to get a wrong permissions there?
Does it mean that the default file permissions on your drive are 0000 and specifically for the data location?

It’s ranged by random interval between 12 and 72 hours, so we should collect enough information if we would see a big amount of offline nodes. And most of SNO have a monitoring about online and uptime.

So far only a small amount of nodes are affected. Some QNAP, some Synology and some custom setups.

I would prefer to find a workaround for these nodes and then either update a documentation or file a bug report on GitHub to try another solution, than disabling an automatic update.

I run my node on Synology DSM 7.2.1, the base path is /volume2/storj
the permission are:

d---------+  1 root root  186 Jun 14 13:56 storj
d---------+ 1 root root   186 Jun 14 13:56 .
drwxr-xr-x  1 root root   108 Aug 23 09:32 ..
d---------+ 1 root root   126 Sep 25 13:58 config
drwxrwxrwx+ 1 root root    66 Aug 23 09:32 @eaDir
drwx------  1 root root    50 Nov 17  2023 identity
d---------+ 1 root root   126 Sep 25 13:58 .
d---------+ 1 root root   186 Jun 14 13:56 ..
d---------+ 1 root root    60 Sep 21 10:09 bin
----------+ 1 root root 10775 Nov 17  2023 config.yaml
d---------+ 1 root root    26 Nov 17  2023 orders
d---------+ 1 root root   300 Sep 25 11:07 retain
----------+ 1 root root 32768 Sep 24 19:28 revocations.db
d---------+ 1 root root   886 Sep 25 15:08 storage
----------+ 1 root root   933 Sep 25 13:58 trust-cache.json
d---------+ 1 root root       60 Sep 21 10:09 .
d---------+ 1 root root      126 Sep 25 13:58 ..
-rwxr-xr-x  1 root root 78710968 Sep 21 10:09 storagenode
-rwxr-xr-x  1 root root 32505984 Sep 21 10:09 storagenode-updater

Maybe it’s the Synology default ACL and some permission conflict I do not understand.

bingo. It has no permissions by default.
How is this happened?
I would suggest to change them to at least chmod u+rw or chmod +rw to do not have issues in the future.

1 Like