I’m checking for “all healthy” flag returned in json when attempting to contact the node over http from a single monitoring cloud host, every 30 seconds.
Recently some nodes started returning 503. They work fine otherwise, including returning allhealthy: true when checked from another instance.
Has the storagenode implemented some sort of abuse blocking that could block my “abusive” monitoring host?
Haven’t tried. I’ll increase the interval. But it worked for years with these settings. I also have TCP port probes from the same host — and those succeed.
Does not seem to track to the tunnel — some nodes directly connected, some via VPN (iptables masquarading to workaround cgnat)
Do you also have errors in the logs when this happen?
I suppose that not. Usually 503 are happening when the reverse-proxy is unable to contact the proxied service (so a client-facing part is working, but the needed service is not). I do not think, that the node has the same structure - it’s a service itself and doesn’t include proxy, otherwise you also should see the same error in the node’s logs.
It could be also a web-server - backend chain, but for the node - I doubt it. The only web server is the web dashboard.
You are right, there were no issues reported in the logs. grep ERROR /var/log/storj.log | grep -v "download failed" | grep -v "manager closed" | grep -v "context deadline exceeded" produced nothing.
Increasing interval did not help.
The issues resolved themselves in a few days without me doing anything – not rebooting either of the services (except any possible node updates that could have happened in the meantime – but then all my nodes would update at the same time, I’m not using official updater yet, and yet, the http checks started working one by one over the course of a week or so)
I’m out of ideas, and issue not present anymore, so I won’t think about it until it comes back, and will do more digging with tcpdump.
I have a guess that a tunnel/proxy/VPN before the node is caused that.
Because otherwise we could have a lot of complains here, if it was a storagenode’s issue. Furthermore, the satellite is unlikely to be pleased to receive a 503 error from the node. However, the latter statement may be incorrect, as the satellite doesn’t use HTTP, but DRPC, and I don’t know if it has an equivalent to the 503 error.