Sudden spike in suspended Nodes

You can set it up in 30 min on a free cloud instance, e.g. Oracle.

From your logs:

"attempts": 1, 
"error": "ping satellite: check-in network: failed to resolve IP from address: Storj.xxxxxxx.org:28967, err: lookup Storj.xxxxx. .org on 10.111.0.10:53: server misbehaving", 

Few issues here. Storj.xxxxx.org — maybe don’t use mixed case. Just in case.

Next, the satellite clearly could not resolve your (sub) domain. What DNS provider are you using?

Edit: looks like rzone.de, never heard of it, and it looks like even its own root domain does not resolve! Magnificent service and attention to details. I would switch to something else asap.

Edit2. It appears the Storj subdomain is a CNAME to a56812xxxxxxxx.asuscomm.com

That explains it. Please switch to real DDNS service.

(Lastly — not sure what is 10.111.0.10 — probably red herring, maybe internal Satellite DNS)

I would suggest switching to the reliable DDNS provider (large one, like cloudflare, or google domains, not some rinky-dink small contraptions), run inadyn updater on the same host as your storage node to ensure they are both online at the same time.

2 Likes
failed to resolve IP from address

Thanks so much. Sometimes I am really blind from all the logs and miss the obvious… even though it was the first entry. I checked multiple and they all seemed the same but somehow missed the first entry that clearly says that it cant resolve correctly…

Yeah I totally think you found the problem.

With the CNAME that is correct I use the Storj. Domain to then refer CNAME to the asuscomm one. Do you by chance know if the problem is related directly on the level of the Storj. Domain provider or the DDNS Provider of the asuscomm?

Could it help by just directly using the asuscomm one and not the the extra CNAME step?

Edit:
rzone belongs to Strato which is the paid provider that is actually pretty well known and big in germany. I wonder if its really a problem on their end. So far i thought the problem would most likely be with asuscomm as its a free provider that came with the asus router

Yeah, I added a few edits.

It’s likely asus, the rzone simply has to respond with a cname record, hard to mess it up :slight_smile: I apologize for my unfounded accusations!

I’ve tried a few more things. Right now it is resolving for me (I’m in California) to IPv4 ending ….xx2.109, and I see instance of next cloud; but probing port 28967 fails. This would not cause the failure to resolve the domain of course but once that is fixed — this is the next obstacle to overcome, assuming it even resolved the address correctly.

Does your router have some sort of smart “intrusion prevention”, or some sort of anti-DDOS thingy that may be easily spooked by storagenode access pattern and randomly preventing access from internet?

Btw I have an instance of Kuma running on Oracle watching my nodes, I can add your url too, PM me your email address to send notifications to, at least until you resolve it or setup your own Kuma.

2 Likes

You are Amazing…

Your Idea of Intrusion detection… I disabled it originally on my router but did a firmware update not long ago. And that enabled it again…

I also went and set up noip ddns as well to monitor both and see if that makes it better.

Thanks for your offer with kuma I’ll send you a DM. With the new knowledge i now have and your external monitoring I should quickly be able to see if the no-ip ddns works better and is the root cause ^-^ I should be able to easily figure it out and fix it. I’ll dm you about the kuma thanks!

3 Likes

Hi,

I wanted to do a quick follow up to explain the full extent of the problem and how it was fixed:

Short Version:
DNS did not respond correctly at least once and Routers DOS Protection kicked in.
Asus DDNS Service was unstable causing occasional added problems and had to be switched to noip.

Longer Version:
2 Problems could be identified.

  1. The log showed that the DNS did not answer at least on one occasion
  2. The Router itself had a activated DOS protection.

It seems that the second Problem was the main cause of missing the online checks. The access pattern of StorJ in conjunction with my other services in my network caused my Router to hold back packages and only answer with extremely high delays. I could detect Uptime robot pings of sometimes 15 Seconds. The delay could get so big that the online checks failed.

The problem was hard to identify as my Router (asus ax56u) had 2 settings related to DOS protection. A AI DOS Protection that I identified earlier and disabled. The Second setting was located in the firewall tab directly and was separate. At the same time the Router was not overloaded and had CPU usage of 1-5% and ram also had capacity. As such I didn’t see the blame on the router itself initially. After disabling the second DOS Protection setting the uptime robot delays dropped immediately and are now at 100-200ms instead of 5000-15000. Access to other services like Next cloud also improved drastically.

After this change I haven’t missed any uptime checks yet and all log entries related to “Service Ping Satellite Failed” have not appeared anymore. As such I will not immediately switch my DNS provider yet (asus) but will rather have both Noip and Asus running parallel and just monitoring the DNS querries in a constant loop looking for problems. (Powershell script that does a NS Lookup on both Noip and Asus DDNS as a loop every few seconds and logs failures). If NoIP seems more stable than Asus I will update this post and switch to it. Otherwise I will stay with Asus.
Edit:-> for 17 days Asus ran stable and did not miss a online check but then today for around 2 hours Asus just died and noip ran stable. So I have now switched all my homeserver stuff including Storj to noip.

6 Likes

likely useless, since you have a dnscache service running. Perhaps you need to check it against other preferable random DNS servers.

1 Like

@Alexey thanks for the heads up. A very important thing to be aware of.
At least under windows I can run “ipconfig /flushdns”.

But that still leaves the cache in the closest DNS Server that I will likely be asking. (Or even the Router if i let my PC use it as a cache)
Difficult :thinking: I’ll look for a spread out monitoring service like DNS-Checker just with a better logging of results. And Storjs logs at least also give me a error if the current configured dns fails.

Current update: No missed Online-Checks. and a higher Egress and Ingress as well since the fix. So maybe it won’t even be necessary to further check this route as long as all stays stable.

2 Likes

DDNS-updated records usually have a very low TTL on purpose, so caching should not be a problem. (As long as every server in the chain honors that TTL, of course)

I would like to bring up another downside of using shoddy “free” DNS providers, like asuscom: response time.

When storj clients download a file, they send more requests than needed, and once enough data is received – the remaining transfers are cancelled, unpaid.

So, if one node is resolved in 10 ms, and starts pumping data, and another takes 2 seconds just to get to IP – guess which one will be winning races and which one will be losing money?

I took the liberty to run a very basic test (you can replicate it yourself) using ASUSCOM hosted DNS name (redacted) and using one of my Cloudflare hosted DNS names that I use for my nodes.

I ran below commands side by side for an hour. Twice during that period I saw weird latency spikes on asuscom, lasting 2-3 minutes, and none on Cloudflare. (my system resolver is NextDNS. I saw similar spikes with Google resolver – add @8.8.8.8 to dig command, but I did not bother running them side by side again)

Just something to consider.

ASUSCOM:

~ % while true; do dig xxxx_redacted_xxxxx.asuscomm.com | grep "Query time" ; done
;; Query time: 2030 msec
;; Query time: 2023 msec
;; Query time: 18 msec
;; Query time: 48 msec
;; Query time: 1018 msec
;; Query time: 53 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 26 msec
;; Query time: 48 msec
;; Query time: 19 msec
;; Query time: 49 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 837 msec
;; Query time: 52 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 20 msec
;; Query time: 50 msec
;; Query time: 41 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 33 msec
;; Query time: 52 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 917 msec
;; Query time: 52 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 41 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 44 msec
;; Query time: 44 msec
;; Query time: 47 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 45 msec
;; Query time: 40 msec
;; Query time: 43 msec
;; Query time: 49 msec
;; Query time: 44 msec
;; Query time: 46 msec
;; Query time: 44 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 46 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 49 msec
;; Query time: 52 msec
;; Query time: 49 msec
;; Query time: 24 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 18 msec
;; Query time: 1013 msec
;; Query time: 51 msec
;; Query time: 27 msec
;; Query time: 26 msec
^C

Cloudflare:

% while true; do dig <my_cloudlrade_DDNS_name> | grep "Query time"; done
;; Query time: 42 msec
;; Query time: 31 msec
;; Query time: 48 msec
;; Query time: 44 msec
;; Query time: 48 msec
;; Query time: 51 msec
;; Query time: 45 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 45 msec
;; Query time: 44 msec
;; Query time: 45 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 48 msec
;; Query time: 43 msec
;; Query time: 48 msec
;; Query time: 26 msec
;; Query time: 47 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 49 msec
;; Query time: 46 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 47 msec
;; Query time: 47 msec
;; Query time: 45 msec
;; Query time: 42 msec
;; Query time: 45 msec
;; Query time: 44 msec
;; Query time: 46 msec
;; Query time: 47 msec
;; Query time: 43 msec
;; Query time: 45 msec
;; Query time: 44 msec
;; Query time: 22 msec
;; Query time: 49 msec
;; Query time: 43 msec
;; Query time: 47 msec
;; Query time: 46 msec
;; Query time: 43 msec
;; Query time: 44 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 26 msec
;; Query time: 46 msec
;; Query time: 45 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 45 msec
;; Query time: 45 msec
;; Query time: 42 msec
;; Query time: 46 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 49 msec
;; Query time: 43 msec
;; Query time: 56 msec
;; Query time: 50 msec
;; Query time: 43 msec
;; Query time: 47 msec
;; Query time: 42 msec
;; Query time: 42 msec
;; Query time: 65 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 41 msec
;; Query time: 46 msec
;; Query time: 42 msec
;; Query time: 44 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 47 msec
;; Query time: 45 msec
;; Query time: 45 msec
;; Query time: 44 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 44 msec
;; Query time: 43 msec
;; Query time: 44 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 43 msec
;; Query time: 42 msec
;; Query time: 43 msec
;; Query time: 45 msec
;; Query time: 45 msec
;; Query time: 43 msec
;; Query time: 43 msec
2 Likes