1.55.1 Problems: ERROR contact:service ping satellite failed?

svet0slav · June 4, 2022, 1:50pm

I notice some of the nodes get this error after the recent update and I find them offline, so I have to restart the service then they work again. I run the nodes as a service on Ubuntu. No such issues on previous version.

Jun 04 13:43:20 server storagenode[175963]: 2022-06-04T13:43:20.793Z ERROR contact:service ping satellite failed {“Process”: “storagenode”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “attempts”: 7, “error”: “ping satellite: failed to dial storage node (ID: XXX) at address subdomain.domain.tld:28967: rpc: tcp connector failed: rpc: context deadline exceeded”, “errorVerbose”: “ping satellite: failed to dial storage node (ID: XXX) at address subdomain.domain.tld:28967: rpc: tcp connector failed: rpc: context deadline exceeded\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Any idea how to diagnose what is causing it? Several nodes on same physical server and they randomly go offline and the journal is full of such errors. There were no known network/connectivity interruptions on the infrastructure. This happens only on this server and on STORJ nodes randomly. Sometimes it is node1, sometimes node3 from same server, etc.

The impact on the nodes reputation is terrible. Example:

Please help me diagnose and fix. I do not think it is something on my end. All nodes were working fire before the last update.

Temporary solution I could do - restart the services regularly with cron, but I need not a hack. I need a solution.

Also, what about monitoring? I am using the telegram bot, but it does not notify me when node goes offline this way. Why? It used to be working and notifying Weird… Something is wrong with this new update 1.55.1, i think.

deathlessdd · June 4, 2022, 2:42pm

You seem to be a few updates behind, We are going to 1.56.3 current is 1.55.1

svet0slav · June 4, 2022, 2:56pm

Typo. 1.55.1 is what I mean, indeed. When is 1.56.3 coming? Hope it solves my problem…

JMCD · June 5, 2022, 8:34am

Exactly the same problem on my side, also with v1.55.1. It stopped working a few days ago, and no changes in my configuration.

This is quite disappointing because the problem is killing my nodes reputation.

Alexey · June 5, 2022, 11:35am

This is a network issue, and unrelated to storagenode version. The satellite cannot reach your node or the node doesn’t respond.
You can see when your node did not answer on audit requests from the satellites using these scripts:

Perhaps your domain was not updated to a new public IP, or there is an issue either with your firewall, or router or ISP.

svet0slav · June 6, 2022, 3:36am

Then why was it working without issues on previous versions? Weird.

No. No issues there, nor was anything changed.

Alexey · June 6, 2022, 5:14am

If the satellite cannot contact your node (your node doesn’t answer), the version doesn’t matter.
I would like to offer to reboot your router first and monitor your logs for a while.
Do you have an uptimerobot configured?

svet0slav · June 6, 2022, 5:38am

Router uptime: 23w6d12h20m2s without issues till the last update of STORJ node software. I do not think it is the router. No other problems on the entire infrastructure. This is STORJ node related only. Today it happened on 2 nodes again. I’d rather set up a cron to restart them every hour. The restart takes like less than a second, so… Till this is solved, that’s what I would do. As you can also see, I am not the only one complaining about it, so please check the update for bugs, @Alexey. Thanks!

Alexey · June 6, 2022, 6:42am

I cannot confirm or reproduce the issue with 1.55.1, no one of my nodes have such issues, so it must be the local problem.
Please start research from your DDNS hostname, you can try to use a public IP instead of DDNS hostname to exclude the problem with DDNS updater, of course if your IP would change - your node will go offline. But at least you will not see errors related to “ping satellite failed”, while it’s the same.
If you still see them even when you use a public IP, then it’s something between your node and the satellite - firewall, router or your ISP.

svet0slav · June 6, 2022, 8:02am

The IPs won’t change. They are static. DNS is set for static IPs at cloudflare with proxy turned off, so cloudflare is used for free DNS server instead of the server itself.

Then why is this happening after the update and did not happen before?

svet0slav · June 6, 2022, 8:07am

Sure. Always a local problem… You see up there? I am not the only one complaining about this! And I did not change anything to configuration just like @JMCD said.

Please, take the matter seriously without pointing fingers at our ends as the origin of issue. Looks like it is not it.

Alexey · June 6, 2022, 8:54am

I took it seriously. “ping satellite failed” is a network issue. There is no other way, sorry.

JMCD · June 6, 2022, 5:47pm

What does this mean?

ping satellite: check-in ratelimit: node rate limited by id

2022-06-06T17:45:09.737Z ERROR contact:service ping satellite failed {“Process”: “storagenode”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “attempts”: 10, “error”: “ping satellite: check-in ratelimit: node rate limited by id”, “errorVerbose”: “ping satellite: check-in ratelimit: node rate limited by id\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:136\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:98\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Alexey · June 6, 2022, 5:56pm

When your node is trying to check-in on the satellite, but the satellite cannot reach your node, your node trying to check-in over and over again without correction of the issue, so the satellite is starting to throttle it.
When you fix the issue with connectivity, those errors should go after a while.

JMCD · June 6, 2022, 7:08pm

I think I found the problem. Without any notice, my Internet carrier activated CGNAT (public IP sharing between routers), so no ports can be accessed from outside. I have requested them to revert the setting. I will confirm resolution in the next 24 hours.

@svet0slav, check this with your Internet Service Provider.

svet0slav · June 7, 2022, 6:22am

My error is different. Pointed out above.

svet0slav · June 7, 2022, 6:26am

Then please explain to me how it was working well before this update to 1.55.1. Today I saw 3 nodes offline. Restarted the nodes services and they work fine again. The server is in a data center with 60+ ISPs and a switch for all servers we have there. Quite uninterruptible, if you ask me. No other network issues for the entire infrastructure. Must be something specific on that server. I set up cron to restart the nodes every hour. Most probably I will see them online every time I check now. Any better ideas?

Alexey · June 7, 2022, 6:31am

Could you please try to use an IP instead of DNS hostname in your node? Maybe Cloudflare started to throttle DNS resolution requests?

As you can see, there is almost no change in storagenode code: Release v1.55.1 · storj/storj · GitHub
And no change in auditors or repair workers code, only reporting and web UI.

So, update is irrelevant, something changed in your setup - providers changed something or some hardware start to have problems.

svet0slav · June 7, 2022, 6:48am

I will try this and see… OK. Thanks!

Nothing changed on the server. Just regular OS updates. Server is in perfect health and in very good conditions.

JMCD · June 7, 2022, 7:44pm

Confirmed resolution when ISP removed me from CGNAT. Node back online.