Only one (!) satellite (ap1.storj.io) - ping satellite failed except ratelimit - why? - diagnosis needed

peem · January 25, 2023, 8:19am

How to determine what is the reason for this?

2023-01-20T18:21:22.410Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 2, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: dial tcp xxx.121.130.122:20005: connect: connection timed out", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: dial tcp xxx.121.130.122:20005: connect: connection timed out\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:145\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:100\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-01-20T18:21:25.223Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 3, "error": "ping satellite: check-in ratelimit: node rate limited by id", "errorVerbose": "ping satellite: check-in ratelimit: node rate limited by id\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:139\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:100\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

ap1

elek · January 25, 2023, 5:25pm

Your storagenode tries to connect to the satellite, which double checks if the address (reported by the storagenode) can be accessed.

This is the address which is configured on the storagenode side (in case of docker with ADDRESS="xxx:20005"

First, I would try to access the same ip and port from any other machine. With a new enough storagenode it should response with a generic json message (you can try from the browser: http:// xxx.121.130.122:20005)
You may used a DNS in storagenode configuration, in that case it can be a DNS caching issue, but you can also double check if it works with dns)
If it works well from a browser, please let me know the NodeId and/or email. I tried to find the missing xxx from the IP address, but couldn’t find similar nodes in the satellite log. You can send me a private message, and I will check what satellite thinks about your node…

Alexey · January 26, 2023, 3:30am

Your node doesn’t respond to requests from this satellite. I would also check a firewall, a router (perhaps you need to disable throttling or “smart” protection), an ISP.

peem · January 26, 2023, 6:48am

It doesn’t respond, but knows about it and writes it to the log? How does it know? From the satellite?
Is the router, firewall or ISP so smart that it only blocks from that one satellite? And only at times?

In view of this, how do I get data to store, and I am according to this satellite offline?

Online continues to deteriorate

ap4

Alexey · January 26, 2023, 8:46am

The node check-in on the satellite and while the connection to the satellite is open, it can receive scores update and feedback from the satellite about connectivity of your node with provided public connection information.

you getting data not from the satellite but from the customers (which obviously have a different IP and likely location too than the satellite). And transferring data to the customers, not to the satellite.
The only data exchanged between your node and the satellite is for bookkeeping, online checks, audits and repairs. Since your online score is not zero, sometimes your node was able to respond to audits.
You may check with these scripts when this happened and how many audits are missed:

So you need to check why packets from IPs of that satellite are blocked on your side (firewall, router, ISP).

peem · January 26, 2023, 9:13am

And how does the customers know about my node? From satellite? But satellite claims I’m offline, then why does it give customers my address and claim they can leave data on my node?

Audit 100% (see illustrations above)

No firewall on OS and on router, ISP doesn’t belong and doesn’t depend on me

elek · January 26, 2023, 3:58pm

The highest chance is that you have some DNS issue, at least if you use DNS.

I couldn’t find any similar IP address in the satellite database. If you share IP or NodeID with me, I can check

peem · January 26, 2023, 4:50pm

I’ve sent a PM

I switched from:

-e FQDN:PORT

to

-e IP:PORT

Meanwhile:

ap5

Alexey · January 27, 2023, 4:52am

From the cache, or it’s resolved from time to time, but unstable. This could be if the DNS provider have some issues.

Please check a count of requested and count of responded audit requests from the output of the script above.
The audit score is affected, when your node responded to the audit request and did not provide a piece for audit or it’s broken, it will grow with any successful audit.
The online score is affected, when your node did not respond on the audit request at all, it will grow if your node responded on the audit request with any result.

your node should be available for days to see a grow, the online score is calculated within rolling 30 days window, so it should went out of the period when it was offline.

hatred · January 27, 2023, 9:20am

I have the same problems and I also want to switch from FQDN to iP to fix DNS issues.
I am using docker compose.
what line should I add to the config?

    environment:
      - STORAGE=500GB
      - IP=PORT ????

peem · January 27, 2023, 10:27am

Sorry, I don’t know, I don’t use docker compose, just the usual “docker run…”

but switching to IP instead of FQDN didn’t change anything in this case…

peem · January 27, 2023, 10:38am

@elek - did you read my PM?
@Alexey - I executed the script, but it did not fix the problem, it just showed that there was a problem with one satellite (which I’ve known for a week now)

{
  "id": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo",
  "auditHistory": []
}
{
  "id": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
  "auditHistory": []
}
{
  "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
  "auditHistory": [
    {
      "windowStart": "2023-01-21T00:00:00Z",
      "totalCount": 2,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-22T00:00:00Z",
      "totalCount": 5,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-22T12:00:00Z",
      "totalCount": 4,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-23T12:00:00Z",
      "totalCount": 1,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-24T00:00:00Z",
      "totalCount": 3,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-25T00:00:00Z",
      "totalCount": 4,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-25T12:00:00Z",
      "totalCount": 11,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-26T00:00:00Z",
      "totalCount": 7,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-26T12:00:00Z",
      "totalCount": 6,
      "onlineCount": 0
    },
    {
      "windowStart": "2023-01-27T00:00:00Z",
      "totalCount": 2,
      "onlineCount": 0
    }
  ]
}
{
  "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
  "auditHistory": []
}
{
  "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
  "auditHistory": []
}
{
  "id": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB",
  "auditHistory": []
}

peem · January 27, 2023, 2:19pm

@elek

"error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed

In this message:

is the node pinging the satellite?
or
the satellite ping the node?

peem · January 27, 2023, 2:46pm

Specify the range of those IP addresses

Alexey · January 29, 2023, 3:25am

of course it will not fix the problem, since the problem somewhere between your node and the satellite, but it will show how many audits were missed and when. If you used DDNS with unstable resolution (like changeip.org or other their domains), this could be a problem.
But you said

So do you still see errors like

?

the satellite is trying to connect your node to check connection, it’s not ICMP ping, it’s a complete dRPC request. In this case your node is not responding on this request, like if the message was dropped somewhere before arriving to the node.

$ nslookup saltlake.tardigrade.io
Server:  UnKnown
Address:  192.168.1.1

Non-authoritative answer:
Name:    saltlake.tardigrade.io
Address:  34.94.153.46

$ nslookup saltlake.tardigrade.io 8.8.8.8
Server:  dns.google
Address:  8.8.8.8

Non-authoritative answer:
Name:    saltlake.tardigrade.io
Address:  34.94.153.46

Alexey · January 29, 2023, 3:29am

It should be

    environment:
      - STORAGE=500GB
      - ADDRESS=231.123.32.89:28967

replace the fake IP in this example to yours.

Alexey · January 30, 2023, 5:24am

A post was merged into an existing topic: Node got suspended on saltlake

peem · January 30, 2023, 1:43pm

Yes, continues to happen, the last such error:

2023-01-30T12:11:25.658Z	ERROR	contact:service	ping satellite failed 	{"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 2, "error": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: dial tcp xx.xx.xxx.122:20675: connect: connection timed out", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 0, rpc: tcp connector failed: rpc: dial tcp tcp xx.xx.xxx.122:20675: connect: connection timed out\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:145\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:100\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

I’m doing attempts with a different port, changing the system DNS, docker and OS reboots, etc, but I still get errors like this…

Alexey · January 31, 2023, 3:30am

Then something is blocking connections from IP of the satellite (maybe even the whole subnet).
Please also make sure that IP in the log is matching your WAN IP and public IP from Open Port Check Tool - Test Port Forwarding on Your Router

peem · January 31, 2023, 3:33pm

With all due respect, I don’t understand what you are writing to me about

I have only one address and one port: xx.xx.xxx.122:20675

As you can see in the picture there is no problem with connectivity to five satellites, only one.
ap7

Then what is wrong with my IP and port? After all, I don’t have different IPs and ports for different satellites!
It’s not hard to guess what the Open Port Check Tool shows for the 633rd time, but if you don’t believe it:
port

It is on the satellite side that there are different interfaces:
Non-authoritative answer:
Name: ap1.storj.io
Address: 35.189.132.42
Name: ap1.storj.io
Address: 34.80.215.116
Name: ap1.storj.io
Address: 34.92.204.130

Is it possible that a node sometimes responds to a drcp ping request, but not to the address the satellite expects it to?