Node not working after update

julez · November 11, 2024, 9:40am

Node updated automatically yesterday and has not been working since.

Logs show pieces downloading and uploading prior to update. Version then updated from 1.114.6 to 1.115.5 at 18:35 on 10th Nov.

After this node restarted and reread the configuration as usual. It then failed to resolve the server address. Logs show “server misbehaving error”. This was a possible issue with the DDNS name which occurs occasionally.

Switched to a different DDNS name which resolves OK. This now shows the node is currently unable to accept connections on the service port resulting in connection reset by peer errors. Even local connection to the service port result in connection reset by peer. Dashboard port has no issues and connects fine. Error messages as per below.

2024-11-11T08:52:51Z    ERROR   contact:service ping satellite failed   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

curl localip:28967
curl: (56) Recv failure: Connection reset by peer

When switching logging to DEBUG level, log shows lots of expired pieces being deleted, which seemed odd too. For now I’ve stopped everything.

Any ideas?

Alexey · November 11, 2024, 3:38pm

Hello @julez,
Welcome back!

What’s OS? What’s method to run a node - as a service (Windows/Linux GUI) or docker?

For me sounds like a network issue, could you please try to reboot your host?

julez · November 11, 2024, 5:00pm

Thanks @Alexey. Its running in Docker on Synology. Has been running for 5 years and have had very few issues over the last few years (mainly DDNS record related).

I also think its network related as have made some changes two or three weeks ago. However, the logs clearly show everything working until the update to 1.115.5 last night. Then it restarts and starts having issues.

Complains of DDNS host not resolving. Fine, this happens every now and then with my DDNS provider so I change to a backup DDNS service.
Storagenode then starts and resolves node OK but start getting the error messages saying “Connection reset by peer”. So I’ve searched previous posts and find that I should also be able to connect via http to service port locally and get some kind of JSON response. So trying to remove port forwarding from the equation I connect locally but I get same error “Connection reset by peer” from curl to service port.
Further searching reveals host could be starved of ports I then restart Synology and docker. Same result. Dashboard port on same container works fine.

I’m really stuck!

julez · November 11, 2024, 5:06pm

This is the point at which it all started:

2024-11-10T18:35:02Z    INFO    piecestore  downloaded  {"Process": "storagenode", "Piece ID": "XVMONYFRKZNU2I7HAF5YMZM2PENOKFFWDCRSBN2GVN7YR6KWFSHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 16384, "Remote Address": "172.17.0.1:43726"}
2024-11-10T18:35:02Z    INFO    Got a signal from the OS: "interrupt"   {"Process": "storagenode"}
2024-11-10T18:35:02Z    INFO    piecestore  upload canceled {"Process": "storagenode", "Piece ID": "NYSW672XMGCDAWDAWSNODBGCXM5VWA4YJFZWQZ5LMVRULBVAD3BA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Remote Address": "172.17.0.1:41966", "Size": 458752}
2024-11-10T18:35:02Z    INFO    piecestore  upload canceled {"Process": "storagenode", "Piece ID": "6XW4TFEOF74Q3HFKECQQZFPTCVZQ3DFHZSGOAZEIEB67AHGNARRA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "172.17.0.1:42126", "Size": 262144}
2024-11-10T18:35:02Z    INFO    piecestore  upload canceled {"Process": "storagenode", "Piece ID": "PXLMCKJPAKOT2SQJGNTIAA256AHIKG44TE3TELZBKHZRFQQ6YYMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "172.17.0.1:41256", "Size": 196608}
2024-11-10T18:35:04Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-11-10T18:35:04Z    INFO    Anonymized tracing enabled  {"Process": "storagenode"}
2024-11-10T18:35:04Z    INFO    Operator email  {"Process": "storagenode", "Address": "XXXXXX"}
2024-11-10T18:35:04Z    INFO    Operator wallet {"Process": "storagenode", "Address": "XXXXXX"}
2024-11-10T18:35:05Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode"}
2024-11-10T18:35:05Z    ERROR   failure during run  {"Process": "storagenode", "error": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXX\" failed: lookup XXXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Invalid configuration: invalid contact.external-address: lookup \"XXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:62\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
2024-11-10T18:35:05Z    FATAL   Unrecoverable error {"Process": "storagenode", "error": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:62\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}

Alexey · November 11, 2024, 5:08pm

This is mean that your DNS server in your LAN is not working. Please try to add --dns 8.8.8.8 to your docker run command before the image name.

julez · November 11, 2024, 5:10pm

Thanks, yes that was the problem. I’ve fixed that and now get:

2024-11-11T08:52:51Z    ERROR   contact:service ping satellite failed   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Alexey · November 11, 2024, 5:14pm

DDNS is not the same as DNS. The first one allows to update your DDNS domain dynamically with a new IP. DNS - Domain Name Server, it resolves domains to IP.

Could you please show your docker run command?

Do you also have a FATAL or Unrecoverable errors?

julez · November 11, 2024, 5:22pm

Yes, sorry for confusion over DDNS and DNS. Problem with DDNS is that sometimes the DNS records it propagates end up with issues. These then show up in DNS requests which return the wrong info. DNS (locally and externally) is now resolving the server address correctly so pretty sure its not this as error messages are showing ping satellite to correct IP.

Docker run command is:

docker run -d --restart unless-stopped --stop-timeout 3000 -p 28967:28967/tcp -p 28967:28967/udp -p 14002:14002 -e WALLET="XXXX" -e EMAIL="XXX" -e ADDRESS="XXXXX:28967" -e STORAGE="6TB" --mount type=bind,source="/volume1/STORJ/storagenode/data",destination=/app/config --mount type=bind,source="/volume1/STORJ/storagenode/Identity",destination=/app/identity --name storagenode storjlabs/storagenode:latest

No more FATAL errors, just ERROR errors and WARNs of service node scores worsening.

julez · November 11, 2024, 6:05pm

Shall I just leave it running? Is it some kind of tidy up after the upgrade that must complete before it starts accepting connections? Should I get a response after connecting to the port via HTTP locally?

Alexey · November 11, 2024, 6:18pm

No, it’s unrelated to the update, but likely related to restart. Did you also update the image?
Do you have a firewall on Synology?

Could you please try to execute in the shell on Synology:

curl localhost:28967

Please also try to add --privileged to your docker run command before the image name.

julez · November 11, 2024, 6:36pm

Thanks for your help

From Synology host

curl localhost:28967
curl: (56) Recv failure: Connection reset by peer

With container stopped

curl localhost:28967
curl: (7) Failed to connect to localhost port 28967 after 0 ms: Error

With --privileged flag:

curl: (56) Recv failure: Connection reset by peer

There are lots of other weird ERROR messages:

2024-11-11T18:32:54Z	INFO	pieces	used-space-filewalker started	{"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z	INFO	lazyfilewalker.used-space-filewalker	starting subprocess	{"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "error": "context canceled"}
2024-11-11T18:32:54Z	ERROR	pieces	used-space-filewalker failed	{"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Lazy File Walker": true, "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:134\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:743\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-11T18:32:54Z	ERROR	pieces	used-space-filewalker failed	{"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Lazy File Walker": false, "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:752\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-11T18:32:54Z	ERROR	piecestore:cache	encountered error while computing space used by satellite	{"Process": "storagenode", "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:752\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z	ERROR	piecestore:cache	error getting current used space for trash: 	{"Process": "storagenode", "error": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:296\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:102\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Not sure if these because of the startup issues or something that helps solve the problems.

Back in a few hours. Thanks again for your help.

julez · November 12, 2024, 1:03am

I’ve just tried to see if the docker container running storagenode itself is able to connect to the 28967 port. I did apt-get install curl within the storagenode container. Also no luck as follows:

# curl localhost:28967     
curl: (7) Failed to connect to localhost port 28967 after 0 ms: Couldn't connect to server
# curl 127.0.0.1:28967
curl: (7) Failed to connect to 127.0.0.1 port 28967 after 0 ms: Couldn't connect to server

Connecting to Dashboard 14002 port is working fine within the container and returning the HTML for the page (I assume this confirms that curl is working OK within the container?)

Whenever I start the container it seems to be doing a lot of used-space-filewalker and trash-cleanup-filewalker activities. Could it be that it doesn’t start listening until these have completed?

arrogantrabbit · November 12, 2024, 1:31am

So what port is your node configured to listen on, and on what interface?

Alexey · November 12, 2024, 5:01am

This is mean that storagenode is listening on another port.
could you please show from the container:

grep "server.address" /app/config/config.yaml

Please also search for FATAL or Unrecoverable errors:

docker logs storagenode 2>&1 | grep -E "FATAL|Unrecoverable"

julez · November 12, 2024, 8:12am

Thanks. We’re back up and running. It’s a little embarrassing!

Firstly, no FATAL or Unrecoverable errors. But grepping server address in config.yaml showed that I had commented this out as part of my initial troubleshooting and incorrectly removing some old config entries that were being flagged in config file as invalid in docker logs output:

|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.piece-scan-on-startup}|
|---|---|---|---|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: retain.concurrency}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.trust.exclusions}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: server.address}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.max-concurrent-requests}|

I thought I had checked these against the current defaults at: Config.yaml - Linux/Docker all parameters but looks like I missed that the server.address was still valid. Still not sure why my docker logs showed the key as invalid previously but not now?

Hopefully I can recover node sufficiently as my online %age has dropped after being off for 36 hours.

Thanks again for your help.

Alexey · November 12, 2024, 8:24am

This one is INFO from the

it shares the same config.yaml with storagenode and complains, that it doesn’t know some config options. Safe to ignore.

Alexey · November 12, 2024, 8:27am

You shouldn’t modify config.yaml if you use docker. It usually handles all changes from the command line options.
The command line options have a precedence above the options in the config.yaml, but if you didn’t specify any, it would use defaults (you may see them by running the command docker exec -it storagenode /app/bin/storagenode setup --help)