Hello!
If I remember and read correctly a node will get disqualified if the Audit score goes under 96%. But why are my Audit scores just yellow and not Red if my node is just a bit away from disqualification? Or am I wrong there and a node will just get suspended at 96%? And at which score is the Node disqualified then?
But Online Score is Red waay before you get Disqualified. Why not Audit? I know the system got changed, but not maybe not the threshold? I know that, because the Storagenode Website (that on port 14002), Changes “All Healthy” to False instead of True and triggers my Monitoring, but not the Audit Score.
And Yes, my HDD is failing. Thus the dropping Adit Score. My Monitoring shows this too:
I have an Asus Router yes, but everything is disabled there. And the Storage node is over a vpn cause of cgnat. It is definitely my HDD failing, because the other two are fine. I am just curious about the Audit score dropping and not sending any alarm
But why not set “All Healthy” to false if Audit is below like 99 or 98%? Online Score sets it to false way before disqualification too. I think it would be beneficial because a node operator (that has active monitoring) will be able to repair the node before disqualification then.
Because “All Healthy” is not a long-term indicator, but a time-of-check one. If the node has recovered from issues that resulted in failed audits in the past, you would like to verify that fact, and the “All Healthy” status item is exactly this. If you want a long-term indicator, that’s the scores themselves.
Okay, but what brings the all healthy if it triggers only if my node get disqualified? Then it’s useless. The online score works fine. It warnes me, before my node gets suspended and I can fix the issue, but not Audit
No, a node does not have to be disqualified to not have an “All Healthy” status. If you lose connection to your ISP, you will be disqualified after 30 days, but you will get the “not healthy” signal likely within minutes. Again, time-of-check vs. long-term.
No, you get the “All Healthy: False” if the online score drops below a set percentage (I think it was something like 94% or so). I don’t mean the connection timeout, that would happen if the node goes offline. Because if the node gets into red area, the site on ::28967 changes it’s text from “all healthy: true” to “all healthy: false” and in the same time the http code changes from 200 “all ok” to 500 “internal Server error” (or was it 503 "Service unavailable?)
EDIT: I mean the site on port :28967 not :14002 the dashboard
By the code, the AllHealthy condition is set to false in the following cases:
Node is disqualified at any satellite.
Node is suspended at any satellite.
Node has an online score of less than 0.9 at any satellite.
Node did not connect to any satellite.
So I stand corrected, the AllHealthy status is not just set to false because of short-term problems, but also some long-term. Which is weird, but the case I remembered (lack of connectivity) is there.
Satellite ping is by default every hour, so after at most an hour you would see AllHeatlhy set to false.
Thank you for providing the Code!
It would be nice to Add there the other two stats into the AllHealty status. Cause if for example pieces are failing and making a huge inpact, the node is in fact not healty. The same would make for suspense sense too. So why not add those two into the factor and just online Score and if its already to late.
Connection Problems are easy to monitor:
First you get Emails about the Offline node
and Monitoring (if you have any) would report connection lost.
But why not Suspension and Audit? Those two would go unnoticed until its too late
TBH, I find the status of Storj node monitoring signals very messy. I wrote some code to collect signals from 5 endpoints to get a good-enough picture of node state, it would indeed be nice to have one good place to see everything.