Update halts node

asturking · August 2, 2022, 9:15am

For some time now I have been having some problems with the stability of my nodes since they stop frequently.
At first I didn’t even look at the log but seeing that the situation keeps repeating itself over and over again I see that the pattern is repeating itself.

The log I get when the equipment stops responding is as follows

2022-08-02T08:33:39.211Z	INFO	Downloading versions.	{"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2022-08-02T08:33:39.677Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.59.1"}
2022-08-02T08:33:39.677Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode"}
2022-08-02T08:33:39.694Z	INFO	Current binary version	{"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.59.1"}
2022-08-02T08:33:39.694Z	INFO	New version is being rolled out but hasn't made it to this node yet	{"Process": "storagenode-updater", "Service": "storagenode-updater"}
2022-08-02T08:44:32.516Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2022-08-02T08:44:42.477Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2022-08-02T08:45:55.509Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 12, "error": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout", "errorVerbose": "ping satellite: failed to dial storage node (ID: 12EAz4dcw7BJPmEWMvDA5gTEbSJFUu55MAbpTkrEkN5kmxH6TNP) at address my.ip:28967: rpc: tcp connector failed: rpc: dial tcp 144.24.192.165:29045: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I see that right now is deploying the v1.60.3 version and it seems that the docker is not applying it correctly and stays in that state forever.

This is my docker configuration

docker run -d --restart unless-stopped --stop-timeout 300 \
-p 28967:28967/tcp \
-p 28967:28967/udp \
-p 14002:14002 \
-e WALLET="0x000000" \
-e EMAIL="my@email.com" \
-e ADDRESS="my.ip:28967" \
-e STORAGE="4TB" \
--mount type=bind,source="/mnt/disk1/storj/identity",destination=/app/ident> \
--mount type=bind,source="/mnt/disk1/storj/config",destination=/app/config> \
--name storagenode storjlabs/storagenode:latest

stefanbenten · August 2, 2022, 12:07pm

The config that you posted there is either made up to hide your actual config settings or simply wrong

The port listed in the address is incorrect.
In the error message above your actual address is listed as XXX:29045 and shows vastly different settings (including port differences).

Due to at least these two mistakes, the satellites or anyone else can reach your successfully, which is crucial for your node to function correctly.

As you are running 100+ nodes, you should know better that hiding this information does not help troubleshooting

asturking · August 2, 2022, 2:38pm

I am using servers that have been decommissioned in my company by configuring one node per 4tb disk, hence I have configured several nodes.

The docker configuration was edited, I didn’t realize to check the information coming out of the log, otherwise I would have edited it too.
I will edit it for security, please do the same too

The problem I have is not due to having several nodes running at the same time in the same machine, I have been running several nodes for more than two years and this problem has never happened to me.

It seems that when the storagenode-updater is launched, it checks that there is a new version in version.storj.io and the node stops, from then on all the communication starts to fail.

stefanbenten · August 2, 2022, 3:41pm

It does looks like your configurations are still incorrect, given the information from the logs above. Maybe you are pulling a wrong configuration?

donald.m.motsinger · August 2, 2022, 6:05pm

This doesn’t look right

asturking · August 2, 2022, 7:12pm

Ok, yes, I might have missed it when copying the configuration from my repository.

Anyway, the nodes work correctly as usual but when the time comes to run the updater is when they stop and start to fail the ping and enter a loop.

Until I don’t restart the docker manually it doesn’t start correctly.

stefanbenten · August 2, 2022, 8:04pm

We would need a full log of these nodes (not sanitized) in order to analyize this further.
I however cannot see this on my 2 remaining docker nodes.

asturking · August 2, 2022, 9:14pm

Can I send it to you privately and not expose sensitive data?

stefanbenten · August 2, 2022, 9:49pm

You can DM them to me or even better open a support ticket and mention this thread.

asturking · August 3, 2022, 9:14pm

Ticket submitted

here is an image of the log so that you can see when the problem starts

once the storagenode-updater runs and checks the version, I lose connectivity.

stefanbenten · August 3, 2022, 10:07pm

I looks like you are missing some logs somehow. The entire startup logging is missing…
Is this just a docker logs output?

asturking · August 3, 2022, 10:25pm

2000 last lines from portainer