julez
November 11, 2024, 9:40am
1
Node updated automatically yesterday and has not been working since.
Logs show pieces downloading and uploading prior to update. Version then updated from 1.114.6 to 1.115.5 at 18:35 on 10th Nov.
After this node restarted and reread the configuration as usual. It then failed to resolve the server address. Logs show “server misbehaving error”. This was a possible issue with the DDNS name which occurs occasionally.
Switched to a different DDNS name which resolves OK. This now shows the node is currently unable to accept connections on the service port resulting in connection reset by peer errors. Even local connection to the service port result in connection reset by peer. Dashboard port has no issues and connects fine. Error messages as per below.
2024-11-11T08:52:51Z ERROR contact:service ping satellite failed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
curl localip:28967
curl: (56) Recv failure: Connection reset by peer
When switching logging to DEBUG level, log shows lots of expired pieces being deleted, which seemed odd too. For now I’ve stopped everything.
Any ideas?
Alexey
November 11, 2024, 3:38pm
2
Hello @julez ,
Welcome back!
What’s OS? What’s method to run a node - as a service (Windows/Linux GUI) or docker?
For me sounds like a network issue, could you please try to reboot your host?
julez
November 11, 2024, 5:00pm
3
Thanks @Alexey . Its running in Docker on Synology. Has been running for 5 years and have had very few issues over the last few years (mainly DDNS record related).
I also think its network related as have made some changes two or three weeks ago. However, the logs clearly show everything working until the update to 1.115.5 last night. Then it restarts and starts having issues.
Complains of DDNS host not resolving. Fine, this happens every now and then with my DDNS provider so I change to a backup DDNS service.
Storagenode then starts and resolves node OK but start getting the error messages saying “Connection reset by peer”. So I’ve searched previous posts and find that I should also be able to connect via http to service port locally and get some kind of JSON response. So trying to remove port forwarding from the equation I connect locally but I get same error “Connection reset by peer” from curl to service port.
Further searching reveals host could be starved of ports I then restart Synology and docker. Same result. Dashboard port on same container works fine.
I’m really stuck!
julez
November 11, 2024, 5:06pm
4
This is the point at which it all started:
2024-11-10T18:35:02Z INFO piecestore downloaded {"Process": "storagenode", "Piece ID": "XVMONYFRKZNU2I7HAF5YMZM2PENOKFFWDCRSBN2GVN7YR6KWFSHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 16384, "Remote Address": "172.17.0.1:43726"}
2024-11-10T18:35:02Z INFO Got a signal from the OS: "interrupt" {"Process": "storagenode"}
2024-11-10T18:35:02Z INFO piecestore upload canceled {"Process": "storagenode", "Piece ID": "NYSW672XMGCDAWDAWSNODBGCXM5VWA4YJFZWQZ5LMVRULBVAD3BA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Remote Address": "172.17.0.1:41966", "Size": 458752}
2024-11-10T18:35:02Z INFO piecestore upload canceled {"Process": "storagenode", "Piece ID": "6XW4TFEOF74Q3HFKECQQZFPTCVZQ3DFHZSGOAZEIEB67AHGNARRA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "172.17.0.1:42126", "Size": 262144}
2024-11-10T18:35:02Z INFO piecestore upload canceled {"Process": "storagenode", "Piece ID": "PXLMCKJPAKOT2SQJGNTIAA256AHIKG44TE3TELZBKHZRFQQ6YYMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "172.17.0.1:41256", "Size": 196608}
2024-11-10T18:35:04Z INFO Configuration loaded {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-11-10T18:35:04Z INFO Anonymized tracing enabled {"Process": "storagenode"}
2024-11-10T18:35:04Z INFO Operator email {"Process": "storagenode", "Address": "XXXXXX"}
2024-11-10T18:35:04Z INFO Operator wallet {"Process": "storagenode", "Address": "XXXXXX"}
2024-11-10T18:35:05Z INFO Got a signal from the OS: "terminated" {"Process": "storagenode"}
2024-11-10T18:35:05Z ERROR failure during run {"Process": "storagenode", "error": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXX\" failed: lookup XXXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Invalid configuration: invalid contact.external-address: lookup \"XXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:62\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
2024-11-10T18:35:05Z FATAL Unrecoverable error {"Process": "storagenode", "error": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Invalid configuration: invalid contact.external-address: lookup \"XXXXXXX\" failed: lookup XXXXX on 192.168.1.5:53: server misbehaving\n\tstorj.io/storj/storagenode.(*Config).Verify:172\n\tmain.cmdRun:61\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:62\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
Alexey
November 11, 2024, 5:08pm
5
This is mean that your DNS server in your LAN is not working. Please try to add --dns 8.8.8.8
to your docker run
command before the image name.
julez
November 11, 2024, 5:10pm
6
Thanks, yes that was the problem. I’ve fixed that and now get:
2024-11-11T08:52:51Z ERROR contact:service ping satellite failed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "attempts": 1, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: read tcp 10.55.10.14:53316->X.X.X.X:28967: read: connection reset by peer\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:102\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Alexey
November 11, 2024, 5:14pm
7
DDNS is not the same as DNS. The first one allows to update your DDNS domain dynamically with a new IP. DNS - Domain Name Server, it resolves domains to IP.
Could you please show your docker run
command?
Do you also have a FATAL or Unrecoverable errors?
julez
November 11, 2024, 5:22pm
8
Yes, sorry for confusion over DDNS and DNS. Problem with DDNS is that sometimes the DNS records it propagates end up with issues. These then show up in DNS requests which return the wrong info. DNS (locally and externally) is now resolving the server address correctly so pretty sure its not this as error messages are showing ping satellite to correct IP.
Docker run command is:
docker run -d --restart unless-stopped --stop-timeout 3000 -p 28967:28967/tcp -p 28967:28967/udp -p 14002:14002 -e WALLET="XXXX" -e EMAIL="XXX" -e ADDRESS="XXXXX:28967" -e STORAGE="6TB" --mount type=bind,source="/volume1/STORJ/storagenode/data",destination=/app/config --mount type=bind,source="/volume1/STORJ/storagenode/Identity",destination=/app/identity --name storagenode storjlabs/storagenode:latest
No more FATAL errors, just ERROR errors and WARNs of service node scores worsening.
julez
November 11, 2024, 6:05pm
9
Shall I just leave it running? Is it some kind of tidy up after the upgrade that must complete before it starts accepting connections? Should I get a response after connecting to the port via HTTP locally?
Alexey
November 11, 2024, 6:18pm
10
No, it’s unrelated to the update, but likely related to restart. Did you also update the image?
Do you have a firewall on Synology?
Could you please try to execute in the shell on Synology:
curl localhost:28967
Please also try to add --privileged
to your docker run
command before the image name.
julez
November 11, 2024, 6:36pm
11
Thanks for your help
From Synology host
curl localhost:28967
curl: (56) Recv failure: Connection reset by peer
With container stopped
curl localhost:28967
curl: (7) Failed to connect to localhost port 28967 after 0 ms: Error
With --privileged flag:
curl: (56) Recv failure: Connection reset by peer
There are lots of other weird ERROR messages:
2024-11-11T18:32:54Z INFO pieces used-space-filewalker started {"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z INFO lazyfilewalker.used-space-filewalker starting subprocess {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "error": "context canceled"}
2024-11-11T18:32:54Z ERROR pieces used-space-filewalker failed {"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Lazy File Walker": true, "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:134\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:743\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-11T18:32:54Z ERROR pieces used-space-filewalker failed {"Process": "storagenode", "Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Lazy File Walker": false, "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:752\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-11-11T18:32:54Z ERROR piecestore:cache encountered error while computing space used by satellite {"Process": "storagenode", "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:752\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:83\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-11-11T18:32:54Z ERROR piecestore:cache error getting current used space for trash: {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:296\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:102\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
Not sure if these because of the startup issues or something that helps solve the problems.
Back in a few hours. Thanks again for your help.
julez
November 12, 2024, 1:03am
12
I’ve just tried to see if the docker container running storagenode itself is able to connect to the 28967 port. I did apt-get install curl within the storagenode container. Also no luck as follows:
# curl localhost:28967
curl: (7) Failed to connect to localhost port 28967 after 0 ms: Couldn't connect to server
# curl 127.0.0.1:28967
curl: (7) Failed to connect to 127.0.0.1 port 28967 after 0 ms: Couldn't connect to server
Connecting to Dashboard 14002 port is working fine within the container and returning the HTML for the page (I assume this confirms that curl is working OK within the container?)
Whenever I start the container it seems to be doing a lot of used-space-filewalker and trash-cleanup-filewalker activities. Could it be that it doesn’t start listening until these have completed?
So what port is your node configured to listen on, and on what interface?
Alexey
November 12, 2024, 5:01am
14
This is mean that storagenode is listening on another port.
could you please show from the container:
grep "server.address" /app/config/config.yaml
Please also search for FATAL or Unrecoverable errors:
docker logs storagenode 2>&1 | grep -E "FATAL|Unrecoverable"
1 Like
julez
November 12, 2024, 8:12am
15
Thanks. We’re back up and running. It’s a little embarrassing!
Firstly, no FATAL or Unrecoverable errors. But grepping server address in config.yaml showed that I had commented this out as part of my initial troubleshooting and incorrectly removing some old config entries that were being flagged in config file as invalid in docker logs output:
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.piece-scan-on-startup}|
|---|---|---|---|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: retain.concurrency}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.trust.exclusions}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: server.address}|
|2024-11-10T22:05:48Z|INFO|Invalid configuration file key|{Process: storagenode-updater, Key: storage2.max-concurrent-requests}|
I thought I had checked these against the current defaults at: Config.yaml - Linux/Docker all parameters but looks like I missed that the server.address was still valid. Still not sure why my docker logs showed the key as invalid previously but not now?
Hopefully I can recover node sufficiently as my online %age has dropped after being off for 36 hours.
Thanks again for your help.
1 Like
Alexey
November 12, 2024, 8:24am
16
This one is INFO from the
it shares the same config.yaml
with storagenode
and complains, that it doesn’t know some config options. Safe to ignore.
1 Like
Alexey
November 12, 2024, 8:27am
17
You shouldn’t modify config.yaml
if you use docker. It usually handles all changes from the command line options.
The command line options have a precedence above the options in the config.yaml
, but if you didn’t specify any, it would use defaults (you may see them by running the command docker exec -it storagenode /app/bin/storagenode setup --help
)
1 Like