1.55.1 Problems: ERROR contact:service ping satellite failed?

I did it and today I find 2 nodes offline, while it does not say anything about it. It thinks they were 100% online. I will have to further investigate on the server itself. UptimeRobot is set to monitor port 28967 on the subdomain.domain.tld on each node. Now this is strange…

Configuration is several network adapter ports with separate /24 network each. Nodes run on the same port - 28967 each on different /24 network from each different network adapter port. Nodes run as a service.

Have you tried to run them on IP instead of domain, since IP is static anyway?

Not, yet. I think it is network configuration conflict in routing. Rewriting netplan to verify instead of using rc.local to do the routing on boot. It was working just fine till like last week.

rc.local example

ip route add default via GW1 dev eno1 table eno1-route
ip rule add from IPv4_address1 lookup eno1-route

ip route add default via GW2 dev eno2 table eno2-route
ip rule add from IPv4_address2 lookup eno2-route

ip route add default via GW3 dev eno3 table eno3-route
ip rule add from IPv4_address3 lookup eno3-route

ip route add default via GW4 dev eno4 table eno4-route
ip rule add from IPv4_address4 lookup eno4-route

I will try this directly in netplan with something like…

netplan example

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      addresses: [ IPv4_address1, "IPv6_address1"  ]
      routes:
          - to: 0.0.0.0/0
            via: GW1
            #metric: 40
            table: eno1-route
      routing-policy:
          - from: IPv4_address1
            table: eno1-route
      match:
          macaddress: port_MAC
      set-name: eno1 
      gateway6: "GWv6"
      nameservers:
          addresses:
          - 1.1.1.1
          - 1.0.0.1
          - 8.8.8.8
          - 8.8.4.4
          - 2606:4700:4700::1111
          - 2606:4700:4700::1001
          - 2001:4860:4860::8888
          - 2001:4860:4860::8844
    eno2
...

and so on for next port on each adapter

Hopefully that works. What the strange part is… The current setup was working without issues for like a year or so. You have to understand this leaves me baffled.

Could it be possible, that you made some changes a while ago and did not reboot the server (or did not restart/reload some reconfigured service), then you rebooted it recently, and now your past changes are applied to your system?
Or maybe with the last OS update it has changed behavior of some services?

Maybe this. I have to investigate further, but it is definitely something on the server itself. Will post updates here in this topic, so others, who may have the problem, find a solution when I do or maybe someone helps… Let’s see… :neutral_face:

1 Like

No changes made, but I think if the routing is MAC bound, it may help instead of it being interface (network card port) bound like it is right now.

For now what I do not understand how come a network/routing problem may occur while problem gets resolved when either:

  • node, which went offline, service is restarted
  • server is restarted (but this brings all nodes on server down while it boots, so not good)

Does not make much sense. Another temporary solution would be like mentioned above: to restart each node with cron as this happens in less than a second and do it like regularly: every hour. Won’t interrupt things that much - maybe only several requests and that’s not a problem because there is like retry function, right?

Maybe a better solution would be - inspect journal and when an offline node is found - restart only that node service. I will be testing this now.

:warning:Temporary solution… Not a real solution, rather just a small Linux hack.

* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk '/ERROR/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'

Cron script above should be added to crontab of either root or user, who can control the node service. It does this:

  1. Runs every minute
  2. Checks journal for “ERROR” messages for a specific node
  3. Restarts the node, if there is an error

Why this SHOULD work?

awk uses a pattern-action paradigm. /ERROR/ is the pattern and {a=1} is the action . After awk has processed all the journalctl output, the END section is executed. A simple if (a == 1) test determines if one or more matches occurred, and if so, the systemctl restart NODESERVICENAME.service command is executed to restart the node on error found in the node log for the last 1 min.

Let’s see what happens…

Appears it also restarts the node on

ERROR piecestore download/upload failed

This does not take the node offline, so it is OK to not restart the node on such errors, I take it.
@Alexey, is this OK to you or I could modify it to this…

* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk '/ERROR/ && /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'

or even

* * * * * /bin/journalctl --since "1 min ago" --until "now" -eu NODESERVICENAME | awk /ping satellite failed/ {a=1}; END { if (a == 1){system("systemctl restart NODESERVICENAME.service")} }'

After all, the nodes DO NOT GO OFFLINE now with this hack implemented on the source server. Even if they do, it is for less than a second - during the service restart.

Again… Why this works:

awk uses a pattern-action paradigm. /ERROR/ and /ping satellite failed/ are the pattern and {a=1} is the action . After awk has processed all the journalctl output of the node it is testing, the END section is executed. A simple if (a == 1) test determines if one or more matches occurred, and if so, the systemctl restart NODESERVICENAME.service command is executed to restart the node on error found in the node log for the last 1 min.

You shouldnt restart the node everytime something isnt pingable, this will trigger a filewalker everytime you should fix the underlying issue its not the node itself.

Eh, if the node has enough RAM to cache metadata of all blob files and you restart the node every few seconds so that the OS keeps caching it, the file walker costs almost nothing :wink:

Not really sure how true that is I have a pi4 with 8gigs of ram and it goes though the same amount of time as my pi4 with 4gigs of ram, and if I restart them a bunch of times it still triggers a filewalker and its about the same amount of time.

Maybe if you have a dedicated cache drive that statement would be true. PS both pis never use more then 1gig of ram at start of storj if that.

Really? So my “network problem” just got resolved by this

Today I wake up and find my “network problem” being resolved. All nodes online simply because my cron script checks every minute, if each node got the ping satellite failed error, which was shutting them down, and respectfully restarts that node and it is fully operational again. Wow! What a network problem I have with port forwarding, routers, switches, ISPs, and so on. And it is definitely not the node itself, I tell you… Right! Right! Thanks to me! I really appreciate it! :rofl: :joy:

1 Like

The server itself has 7 nodes…

2 x Intel(R) Xeon(R) CPU E5-2650L v2 @ 1.70GHz, 40 cores
64 GB RAM (8 x 8 GB @ 1600 MT/s)
2 x 12 x 4 TB 3.5-inch Seagate Constellation SAS HDDs (2 RAID cards - dual domain MSA expandable storage with 12 LFF HDDs each - up to 4, but now with just one attached)
8 x 1 GbE network ports, so it can even take 1 more node with separate /24 network.

Should be enough. Used RAM almost never gets above 12-16 GB. It’s more than enough, really.

No need. Just use tmpfs mounts and all temp and cache stuff is on RAM! Almost never turn off server… Only on major kernel updates, if needed, since the server uses livepatch.

Indeed, this kind of hardware should allow you restarting the node even every few seconds without the file walker process having an impact on performance.

Indeed. Anyway, my solution is only a temporary solution, but it works. Have to find the root cause for this. It was all working fine without this small hack till recently, so…