Node temporally offline after each bandwidth usage rollups

MarP · March 2, 2021, 10:20pm

Hello,

in last month, the online % of all mine nodes starts decreasing significantly. They are online for sure, running 100% smoothly for past 3-4months. In logs i noticed that after each rollup, which is happening every hour, there is an unsuccessful satellite ping for 30-sec up to 5minutes. I have three nodes in docker on synology accessing one harddrive(via USB3). I have tried to restart them at different times, so the rollup does not occur at the same time for each of them, but no change in behavior.

the order is following

INFO bandwidth Performing bandwidth usage rollups
ERROR contact:service ping satellite failed (0-10sec)
INFO orders.xxx sending
INFO orders.xxx finished
ERROR contact:service ping satellite failed, “attempts”: 7 (offline 1-5mins)
INFO piecestore downloaded/put/get (standart operation until next rollup)

The Interesting part is that the “offline period” doesnt seems to depend on the amount of the data transferred to/from node. So even a small node with daily avg 200MB ingress/10MB egress is beeing offline easily for 6mins every hour. While stronger node with daily avg 4GB in/0.5GB out may be offline “only” for about a minute.

I dont have logs so far into the past to identify exact point when it started. Of course I’ve been tinkering about the docker network and additional containers, so there is a chance that I have somehow screwed it up. As well the v1.22.2 upgrade occurs about the similar timeperiod.

Thanks

baker · March 3, 2021, 12:19am

Hi MarP, welcome!

What is the ID of the satellite that is failing to contact? Is it always the same satellite?

Do you see any other errors related to orders?

MarP · March 3, 2021, 12:46am

no other errors. Everything is fine except online %. I just rebooted whole NAS. After the node boots up the first rollup results in following failed satelite pings:

node1 (1min offline)
1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE
12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs

node2 (4mins offline) - my smallest node
12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6
12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S

node3(3min offline) -my biggest node
12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo
121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6
After that everything running smoothly.

the full error is following:
2021-03-03T00:17:55.843Z ERROR contact:service ping satellite failed {“Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “attempts”: 5, “error”: “ping satellite error: failed to dial storage node (ID: 12JzetysKqiQGGnV4WpHJYegeqZxRRfq9ZzT3YvbWuARg2AzMBf) at address DDNSNAME.synology.me:28967: rpc: dial tcp WANIP:28967: i/o timeout”, “errorVerbose”: “ping satellite error: failed to dial storage node (ID: 12JzetysKqiQGGnV4WpHJYegeqZxRRfq9ZzT3YvbWuARg2AzMBf) at address DDNSNAME.synology.me:28967: rpc: dial tcp WANIP:28967: i/o timeout\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:141\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Alexey · March 3, 2021, 4:02am

Your node stop to respond on pings from the satellites. I suspect some overload of your NAS. It also possible, that your external IP got changed too often, but your DDNS provider do not update it so fast.
Maybe you should try to use a different DDNS provider.
I checked your real DDNS hostname (I took it from the satellite) and your port 28967 is closed right now.
You can compare the result of the command

nslookup DDNSNAME.synology.me 8.8.8.8

with the real WAN IP on your router and IP on Open Port Check Tool - Test Port Forwarding on Your Router , they should match.

MarP · March 4, 2021, 3:02am

Portchecker is saying port is closed, node is happily doing put/get/repair/delete. https://storjnet.info/ping_my_node is pinging and dialing to nodes successfully.

NAS statistics are in the lowest 1/4 of memory/CPU/disk usage. External IP is same for couple of years, and built-in synology DDNS is very reliable. They have a solid business case based on this.

It is configured properly. Config for node is simple and straightforward. I just dont get the intermittent loss of connection. The only finicky device on my network I dont trust is super cheap coax router from my ISP.

I have put my NAS to DMZ zone on the router, and it make all nodes to work smoothly. It is an increased security risk but at least it will allow the nodes to recover for couple of days before I will try to “fix it”

MarP · March 4, 2021, 3:27am

and within 5 mins of exposing my NAS there was an SSH attack attempt via port 22. Closing everything except nodes ports before I will figure out the way forward.

Alexey · March 4, 2021, 5:14am

Please, do not use DMZ without firewall and fail2ban, it’s dangerous. And also you should use only keys authentication for ssh server.
It’s better to make a port forwarding instead.

MarP · March 4, 2021, 12:42pm

Thanks, all the security measures were on the place. SSH was disabled. This was just a temporary workaround. I’m glad that i found what is causing it. And factory reset of router and configuring again make it working again with port forwarding.