Node is slowly loosing uptime

tylkomat · October 20, 2021, 3:34pm

I noticed my node is gradually loosing uptime.
Yesterday evening it was at 92.72%. Today in the morning it was at 91.31% and now in the evening it is at 90.22%.

The affected satellites are ap1, eu1, us1 and us2.

I use the Grafana Dashboard and it shows no outages in the graph.
What should I do to find out whats going on?

Stob · October 20, 2021, 3:39pm

Hi @tylkomat
Check the log files…

SGC · October 20, 2021, 3:43pm

losing 2.5% in a day seems high, the top 40% of online score represents 12 days of downtime…

this would mean your node seems to be almost totally disconnected without being disconnected… which would make my guess firewall thing.

ofc this doesn’t mean that is the case, but getting down to 90% means you would have been recorded as being offline for 3 days out of the last 30, which would be a lot… in any case.

Stob · October 20, 2021, 4:23pm

This is low if the node is fully disconnected.

For total audit failure during 24 hours (2 x 12 hour window) you would lose 100% / 30 days = 100% / 60 windows * 2 = 3.333% from the score.

edited to clarify the calculation

SGC · October 20, 2021, 4:44pm

i’m just going to leave that there, unsure of the exact numbers.
pretty sure we can have 288 hours before getting suspended and that happens at 60% online score.

this would make the 12 x 24 hours = the top 40% of the online score.
40 divided by 12 is 3 and 1/3 so yeah looks like you are right.

i don’t quite get how you derived that 100/30 number you start with tho, not that it matters…

what i was trying to say was that his 2.5% drop in a day is massive compared to what should be possible if his node was actually considered online.

ofc random flux can throw that online score around on low data nodes, but that would be in one chunk usually, rather than incrementally happening…

however long story short… i would still call it high… in his case.

littleskunk · October 20, 2021, 6:10pm

Your math only works for nodes that have been online for 30 days. If you start a new node and it is offline for 1 out of 2 days that will be displayed as 50%. Please keep that in mind when you calculate how much downtime the percentage might reflect.

tylkomat · October 20, 2021, 6:27pm

Could it be that the mentioned satellites (ap1, eu1, us1 and us2) have been updated to quic and the other 2 are not updated? I see quic timeouts in the log, but also some uploaded pieces on other satellites

Alexey · October 20, 2021, 8:23pm

You can check when your node was offline from outside (did not respond on audit requests):

However, it will give you dates with some time intervals. Then you need to check firewall logs, node’s logs and router logs on that time to figure out, why your node did not respond in that time.