I notice some of the nodes get this error after the recent update and I find them offline, so I have to restart the service then they work again. I run the nodes as a service on Ubuntu. No such issues on previous version.
Any idea how to diagnose what is causing it? Several nodes on same physical server and they randomly go offline and the journal is full of such errors. There were no known network/connectivity interruptions on the infrastructure. This happens only on this server and on STORJ nodes randomly. Sometimes it is node1, sometimes node3 from same server, etc.
The impact on the nodes reputation is terrible. Example:
Please help me diagnose and fix. I do not think it is something on my end. All nodes were working fire before the last update.
Temporary solution I could do - restart the services regularly with cron, but I need not a hack. I need a solution.
Also, what about monitoring? I am using the telegram bot, but it does not notify me when node goes offline this way. Why? It used to be working and notifying Weird… Something is wrong with this new update 1.55.1, i think.
This is a network issue, and unrelated to storagenode version. The satellite cannot reach your node or the node doesn’t respond.
You can see when your node did not answer on audit requests from the satellites using these scripts:
Perhaps your domain was not updated to a new public IP, or there is an issue either with your firewall, or router or ISP.
If the satellite cannot contact your node (your node doesn’t answer), the version doesn’t matter.
I would like to offer to reboot your router first and monitor your logs for a while.
Do you have an uptimerobot configured?
Router uptime: 23w6d12h20m2s without issues till the last update of STORJ node software. I do not think it is the router. No other problems on the entire infrastructure. This is STORJ node related only. Today it happened on 2 nodes again. I’d rather set up a cron to restart them every hour. The restart takes like less than a second, so… Till this is solved, that’s what I would do. As you can also see, I am not the only one complaining about it, so please check the update for bugs, @Alexey. Thanks!
I cannot confirm or reproduce the issue with 1.55.1, no one of my nodes have such issues, so it must be the local problem.
Please start research from your DDNS hostname, you can try to use a public IP instead of DDNS hostname to exclude the problem with DDNS updater, of course if your IP would change - your node will go offline. But at least you will not see errors related to “ping satellite failed”, while it’s the same.
If you still see them even when you use a public IP, then it’s something between your node and the satellite - firewall, router or your ISP.
When your node is trying to check-in on the satellite, but the satellite cannot reach your node, your node trying to check-in over and over again without correction of the issue, so the satellite is starting to throttle it.
When you fix the issue with connectivity, those errors should go after a while.
I think I found the problem. Without any notice, my Internet carrier activated CGNAT (public IP sharing between routers), so no ports can be accessed from outside. I have requested them to revert the setting. I will confirm resolution in the next 24 hours.
@svet0slav, check this with your Internet Service Provider.
Then please explain to me how it was working well before this update to 1.55.1. Today I saw 3 nodes offline. Restarted the nodes services and they work fine again. The server is in a data center with 60+ ISPs and a switch for all servers we have there. Quite uninterruptible, if you ask me. No other network issues for the entire infrastructure. Must be something specific on that server. I set up cron to restart the nodes every hour. Most probably I will see them online every time I check now. Any better ideas?