V1.148.3: Port probe shenanigans

Since version v1.148.3 I get a lot of noise from my connectivity monitors (both, Kuma hosted on oracle vps, and Uptime Robot) – every node, both directly connected, and those over vpn bridges, appear to intermittently fail the port connectivity test as well as http:// request on DRPC port.

Nothing changed in the environment except the storagenode version, and freebsd jail version (since 13.2 is no longer supported, I had to upgrade)

My checks were configured every 30 seconds, with 3 retries, 30 seconds apart. I increased number of retries to 10, 10 seconds apart, and this seemed to have silenced this a bit.

There are no warnings or errors in the logs (that are not expected failures from piecestore)

Note, the intermittent connectivity flakiness existed before, but it was maybe once a month event. Now this is happening few times a day.

Are you receiving “Your Node has Gone Offline” email from hello@storj.io?

No. That email comes after few hours of node being offline. I’m not sure if online score drops or not. I can monitor it for a while.

These on the other hand are intermittent failures where node refuses to respond to port probe or http request. Retrying request 10 times seems to help – but this also may mean it’s rejecting actual customers too.

It’s not related to ISP either – these nodes are on at least 5 different ISPs, and the issue tracks to node version change and underlying go/depenencies change.

I used to have similarly-looking problems when my router’s tables got overwhelmed by traffic. Are you sure the jail upgrade did not introduce any changes in routing/filtering? What is the failure mode (timeout/reset/refused)?

1 Like

Mostly request timeout. Occasionally – ECONNREFUSED.

I had 4 nodes on 13.2 jails, and 5 on 13.4.
Only those on 13.2 upgraded to 13.5. The 13.4 are untouched.

But the issues is now occurring on all.

Which means, it’s something with the node software itself, and not the environment – otherwise, the 13.4 based ones would continue working normally.

1 Like

I had similar problems with oracle free instances. Adding some artificial cpu load solved the problem.

Interesting. Can you elaborate?

These are all separate physical servers though, that are running the same thing they’ve been before. On two of them nothing changed except storagenode version. three of them do connect via oracle instance, but issue is also happening on directly connected ones.

About two month ago I noticed online scores slowly going down for nodes using oracle instances for connection. I found the oracle instances randomly going offline for some minutes. I don’t know the reason but adding some load solved the problem. Using this in crontab to create artificial cpu and memory load:

* * * * * stress --vm 1 --vm-bytes 256M --timeout 30s > /dev/null 2>&1

Ah. This is different and known issue. Not really an “issue” but rather policy. Oracle suspends idle free instances occasionally. The vpn is very light load so the instance appears idle. The solution is to either have them do some work, as you did, or give Oracle credit card and convert your account to pay-as-you-go.

I prefer the latter solution becuse adding artificial load to instances you don’t pay for seems common sense defyingly backwards.

1 Like

This is getting ridiculous.

Here how my monitor was unable to contact the node betwen 13:40 and 13:57.

Andn yet,

uploads and donloads during that time kept going:

% grep "T13:48" /var/log/storj.log  | grep upload | wc -l
      73
% grep "T13:48" /var/log/storj.log  | grep download | wc -l
    1183
% grep "T13:50" /var/log/storj.log  | grep upload | wc -l
      99
% grep "T13:50" /var/log/storj.log  | grep download | wc -l
    1179

Compared to earlier today:

storj# grep "T10:50" /var/log/storj.log  | grep upload | wc -l
      91
 grep "T10:50" /var/log/storj.log  | grep download | wc -l
    1188

Does storagenode implement some abuse protection?

I’m going to shutdown my connectivity monitors, and rely on storj email to tell me when the node actually gets offline. I received over 200 emails from my montior today (combines from Kuma and Uptime Robot – this rules out monitor’s issues). This defats the point.

This may not have anything to do with anything: but a lot of places just had a time change. Any chance a system time is off by an hour someplace?

2 Likes

Good one! Time on nodes is indeed one hour forward!

And yet, it still had uninterrupted, normal traffic during adjusted time

storj# grep "T14:48" /var/log/storj.log  | grep upload | wc -l
     105
storj# grep "T14:48" /var/log/storj.log  | grep download | wc -l
    1297
1 Like

Thanks you for this info. So this is no issue for PAYG account - I was wandering why I did not noticed this problem. Thanks!

The issue persists. No interruption on node traffic when it appears to stop responding to probes. I’ve hidden the issue with 10 min grace period on the monitor. It’s not something I’m interested in debugging.

Maybe you pissed someone and is getting back to you with a kindly DDOS attack?
Or your systems got some malware… maybe something you run on all of them got a supply chain infection?
Something could show a lot of IOPS… cpu, ram, drives, system drive.

Perhaps try dumping the traffic from uptimerobot with tcpdump and see if there’s any clues ?

1 Like

I guess i cloud run tcpdump for a day matching my oracle instance address and then reconcile with reported failures and node activity. I’ll try and report.

I’m sure there is a very stupid and prosaic explanation to what’s happening not involving cross country organized crime. I just haven’t found what it is yet.

So, I looked at tcpdump. And what do you know, when Kuma cannot ping, tcpdump does not receive anything. And yet, internet on the node works, it receives and sends traffic.

So, then I looked at /var/log/messages… And what do I see?

Cosmic disgrace, that's what
Mar 12 17:34:51 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 17:39:00 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 17:39:00 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 17:39:00 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 18:09:51 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 18:09:51 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 18:09:51 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 18:11:55 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 12 18:12:11 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 18:14:15 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.193.135.243
Mar 12 18:14:29 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 18:18:36 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 12 18:18:46 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 18:20:51 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.193.135.243
Mar 12 18:21:01 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 18:43:35 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 12 18:43:39 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 18:45:44 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.193.135.243
Mar 12 18:45:52 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 19:20:54 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 19:20:54 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 19:20:55 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 19:43:35 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 19:43:35 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 19:43:36 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 19:56:00 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 19:56:00 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 19:56:00 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 20:41:15 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 12 20:43:25 storagenode inadyn[23648]: Update forced for alias node-1.arrogantrabbit.com, new IP# 104.193.135.243
Mar 12 20:43:33 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 21:47:37 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 21:47:37 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 21:47:37 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 22:28:55 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 22:28:55 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 22:28:55 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 12 22:43:13 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 12 22:43:25 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 22:45:29 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.193.135.243
Mar 12 22:45:37 storagenode inadyn[23648]: Updating IPv4 cache for node-1.arrogantrabbit.com
Mar 12 23:22:46 storagenode inadyn[23648]: Communication with checkip server 1.1.1.1 failed, run again with 'inadyn -l debug' if problem persists
Mar 12 23:22:46 storagenode inadyn[23648]: Retrying with built-in 'default', http://ifconfig.me/ip ...
Mar 12 23:22:46 storagenode inadyn[23648]: Please note, https://1.1.1.1/cdn-cgi/trace seems unstable, consider overriding it inyour configuration with 'checkip-server = default'
Mar 13 00:45:08 storagenode inadyn[23648]: Update needed for alias node-1.arrogantrabbit.com, new IP# 104.18.0.0
Mar 13 00:47:12 storagenode inadyn[23648]: Received status 524, don't know what that means.
Mar 13 00:47:12 storagenode inadyn[23648]: Zone 'arrogantrabbit.com' not found.
Mar 13 00:47:12 storagenode inadyn[23648]: Error response from DDNS server, exiting!
Mar 13 00:47:12 storagenode inadyn[23648]: Error code 48: DDNS server response not OK

What the hell is 104.18.0.0?

Turns out, inadyn was using Cloudflare’s https://1.1.1.1/cdn-cgi/trace , which intermittently returned 104.18.0.0 (a Cloudflare network address) instead of the my public IP. Inadyn accepted that value and pushed it to the A record, then later corrected it when fallback (ifconfig.me ) returned the real IP, causing repeated flips between valid and bogus addresses.

Apparently, storj caches the ip at some other cadence that I guess was luckily coincident enough (or maybe it disregards bad updates that don’t have a node on the other side – smart!) – this explains why node kept working. And my probe monitors were using FQDNS that was flipping like a turd in the wind.

Forcing a stable checkip provider (e.g. checkip-server = ifconfig.me or default ) and clearing inadyn cache eliminates the issue.

Mystery solved.

:melting_face:

Big thanks to @mike

3 Likes

There’s still the mystery of why did this problem correlate so well with your jails upgrades?

What I learned from this… is to just wait for “Your Node has Gone Offline” emails from hello@storj.io. No emails = no problems!

1 Like