Update halts node

For some time now I have been having some problems with the stability of my nodes since they stop frequently.
At first I didn’t even look at the log but seeing that the situation keeps repeating itself over and over again I see that the pattern is repeating itself.

The log I get when the equipment stops responding is as follows

2022-08-02T08:33:39.211Z	INFO	Downloading versions.	{"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2022-08-02T08:33:39.677Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.59.1"}
2022-08-02T08:33:39.677Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-08-02T08:33:39.694Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.59.1"}
2022-08-02T08:33:39.694Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode-updater"}
2022-08-02T08:44:32.516Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2022-08-02T08:44:42.477Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2022-08-02T08:45:55.509Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I see that right now is deploying the v1.60.3 version and it seems that the docker is not applying it correctly and stays in that state forever.

This is my docker configuration

docker run -d --restart unless-stopped --stop-timeout 300 \
-p 28967:28967/tcp \
-p 28967:28967/udp \
-p 14002:14002 \
-e WALLET="0x000000" \
-e EMAIL="my@email.com" \
-e ADDRESS="my.ip:28967" \
-e STORAGE="4TB" \
--mount type=bind,source="/mnt/disk1/storj/identity",destination=/app/ident> \
--mount type=bind,source="/mnt/disk1/storj/config",destination=/app/config> \
--name storagenode storjlabs/storagenode:latest

The config that you posted there is either made up to hide your actual config settings or simply wrong :slight_smile:

  1. The port listed in the address is incorrect.
  2. In the error message above your actual address is listed as XXX:29045 and shows vastly different settings (including port differences).

Due to at least these two mistakes, the satellites or anyone else can reach your successfully, which is crucial for your node to function correctly.

As you are running 100+ nodes, you should know better that hiding this information does not help troubleshooting :wink:

2 Likes

I am using servers that have been decommissioned in my company by configuring one node per 4tb disk, hence I have configured several nodes.

The docker configuration was edited, I didn’t realize to check the information coming out of the log, otherwise I would have edited it too.
I will edit it for security, please do the same too :slight_smile:

The problem I have is not due to having several nodes running at the same time in the same machine, I have been running several nodes for more than two years and this problem has never happened to me.

It seems that when the storagenode-updater is launched, it checks that there is a new version in version.storj.io and the node stops, from then on all the communication starts to fail.

It does looks like your configurations are still incorrect, given the information from the logs above. Maybe you are pulling a wrong configuration?

This doesn’t look right

Ok, yes, I might have missed it when copying the configuration from my repository.

Anyway, the nodes work correctly as usual but when the time comes to run the updater is when they stop and start to fail the ping and enter a loop.

Until I don’t restart the docker manually it doesn’t start correctly.

We would need a full log of these nodes (not sanitized) in order to analyize this further.
I however cannot see this on my 2 remaining docker nodes.

Can I send it to you privately and not expose sensitive data?

You can DM them to me or even better open a support ticket and mention this thread.

Ticket submitted

here is an image of the log so that you can see when the problem starts

once the storagenode-updater runs and checks the version, I lose connectivity.

I looks like you are missing some logs somehow. The entire startup logging is missing…
Is this just a docker logs output?

2000 last lines from portainer