OnlineScore went to zero in one day

dugwood · March 16, 2022, 6:32am

So the audit score is only related to successful audits? Hence it’s always 100%…

Again, there’s a big issue. I’m sorry but if I have to restart my node to:

receive emails stating that my node is offline, then suspended, then disqualified
know that online score is zero (but audit shows onlineScore=1)
I don’t have the «right» to restart node every single day (else I get bad reputation)
then it’s not a viable project…

dugwood · March 16, 2022, 6:36am

{
  "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
  "auditHistory": [
    {
      "windowStart": "2022-02-12T00:00:00Z",
      "totalCount": 7,
      "onlineCount": 0
    },
    {
      "windowStart": "2022-02-12T12:00:00Z",
      "totalCount": 5,
      "onlineCount": 0
    },
    {
      "windowStart": "2022-02-13T00:00:00Z",
      "totalCount": 10,
      "onlineCount": 0
    },

But onlineScore was 100%!!

dugwood · March 16, 2022, 6:38am

My test is:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audits; done`

I’ve used jq '{id: .id, auditHistory: [.auditHistory.windows[] | select(.totalCount != .onlineCount)]}' for the dump above.

Am I testing the wrong metric? Or the metric IS wrong? Of course I can’t test the previous status, as node was restarted…

Alexey · March 16, 2022, 6:42am

The audit score related only to successful audits.
The suspension score related to unknown errors during auditing
The online score related to number of answered audits (successful or unsuccessful but answered).

Emails got triggered only on check-in, this is a known issue: [storagenode] The email about suspension is coming only after check-in to satellite · Issue #4235 · storj/storj · GitHub
I would suggest to configure [Tech Preview] Email alerts with Grafana and Prometheus to get alerts in time.

Audit score doesn’t affected by online score, they are independent. You can have a perfect audit score and online score 0. There is presumption of innocents, while your node doesn’t fail audits, it considered as perfect.

You does not need to restart your node. Just make sure that your firewall doesn’t block traffic to/from your node. Ideally it should have only incoming rules to allow connections to port 28967 TCP and UDP, and no one outgoing rule (i.e. all outgoing traffic is allowed).

again. The online score was 100%. Your node went offline. Scores calculated on the satellites. Since your node is offline - it can’t receive updates. Your restarted your node and it was able to check-in on the satellites and receive an update that your online score is actually zero.
Since the satellites are unable to contact your node - it will not recover.
You need to fix the offline issue.

dugwood · March 16, 2022, 6:51am

I’ve got 7 nodes, all are monitored the same way: UptimeRobot (on port 28967), internal check (from one node to the other), so nodes were indeed online.

How can one monitor that its node is working as:

ports were opened and listening
audits/scores seem to be useless (as offline node doesn’t get new metrics)

So you’re saying that online score is related to successful audits, but, as you can see:

{
      "windowStart": "2022-02-27T12:00:00Z",
      "totalCount": 24,
      "onlineCount": 0
    }

=> February 27th is clearly bad!

But onlineScore from self audit: OnlineScore went to zero in one day - #15 by dugwood says that onlineScore is almost perfect (server was restarted 3 days ago, so February 27th should impact the score more than 0.9998307952622674)

When is the onlineScore computed? What data is used? Is it onlineCount / totalCount? If so, there’s a computing error! I should have seen the onlineScore going down slowly, but I didn’t.

Can I send you my audit logs or something so that you can tell me what’s wrong?

I should be able to monitor LOCALLY if the node is not working. onlineCount seems to be an option, but onlineScore isn’t!

Alexey · March 16, 2022, 6:59am

Online score is related to any answered audit, doesn’t matter - successful or failed, the main point - answered (no matter what is result).
Suspension score is related to unknown errors during audit, so to answer the node should be online, but answer with unknown error on audit requests, it’s affected immediately.
Audit score is related to known errors during audit, it’s affected immediately.

there was 24 audit requests and

zero answered audit requests = node is offline. Perhaps blocked traffic from/to this satellite.
Make sure that you do not have outbound firewall rules. Make sure that your inbound firewall rules allows connections to 28967 TCP+UDP and IP of your node.
The online score is calculated for the 30 days windows, see How is the online score calculated? - Storj Docs

Not needed. The network configuration has issues. Please check everything: rules on router, rules on firewall, rules on your server.
The onlineScore is dependent on onlineCount and totalCount.

dugwood · March 16, 2022, 7:07am

Then, there IS a bug. That’s all I’m saying from the beginning of this very thread!

If onlineScore is dependent on onlineCount and totalCount, it should have been zero LONG before I’ve restarted the node!

My monitoring includes the following:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audits; done

=> are we okay on the fact that those calls are external?

Or perhaps the big issue here is that satellites doesn’t answer. So the audits are failing, but nothing is going BACK to my node (audit data is only on the satellites, if the satellite doesn’t answer, I don’t have any log so «it seems fine»).

Either way:

if onlineScore is computed locally, there’s a bug. Going from 0.999 to 0 just because I’ve restarted clearly states that onlineScore wasn’t computed from onlineCount and totalCount (or was broken and not updated in audits)
if it’s dependent on external calls, and those calls are faulty, there should be another local metric to see this. For instance, the windows metrics have dates and the like, so if I don’t see any audit for the day before, it means that the node has a big issue.

As for the 30 days window: if it’s the case, it can’t go from 1 to 0 in one day. Plain and simple.

[EDIT] Can I create a simple test? For example I cut the network for one node, then wait until metrics change? Just to check if the issue is still there… But I restart nodes when there Linux kernel updates, so about 6 times a year.

Alexey · March 16, 2022, 7:25am

The audit considered as failed only if the node is answering on audit request but provided an error “file not found” (etc.) or corrupted piece.
If the audit request is not answered it is considered as offline (not failed) and affects online score.
If the satellite doesn’t answer this is mean that something in between is blocking this traffic.

All scores are computed on the satellites, and updated on the node during interactions. If the node is offline (is not able to answer on audit requests), the scores likely will not be updated until the next check-in.
There are two events:

Node check-in, regulated by parameter --contact.interval, 1h by default.
Satellites checks (audits) - they unilinear. You can see, when it was checked by the script to catch missed audits

See also

storagenode setup --help

for all options and their default values.

You can test everything what you need. The calculation is in 30 days window, it can be zero, if you have literally zero answered audits.

dugwood · March 16, 2022, 7:36am

So what metric can I check to see if the node is correctly interacting with nodes?

As audits from:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq .id,.audits; done

said 0.9999 and after restart just said 0, I just understand that the score is computed on satellites, stored locally, so monitoring this value is totally useless (because not updated at all).

What data is always computed? Whether satellites are reached or not? Using:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq .satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/$sat | jq '{id: .id, auditHistory: [.auditHistory.windows[] | select(.totalCount != .onlineCount)]}'; done

Seems to work a bit better: I can see "windowStart": "2022-03-16T00:00:00Z", so that’s an information for me that the last check was made today, and there was errors (3 errors for 3 tests). If date is way too old, or errors are too many, I can define that node is faulty.

Hence:

onlineScore is wrong (as stated from the beginning, it doesn’t reflect «online» if it needs an answer from a satellite)
onlineScore should be fixed to reflect the auditHistory
Or:
there should be a «requestScore» that will show either that audit calls are failing too much or no audit calls were made at all (based on auditHistory and last date available on auditHistory).

To end this conversation, can I say that my node is faulty if:

either auditHistory has an old date for it’s last entry (higher than 18 hours old)
or last 5 entries from auditHistory have onlineCount / totalCount lower than 90% (should be 99% most of the time)

Alexey · March 16, 2022, 7:53am

In case of misconfiguration. If you fix the network issue - it will be updated at least once a hour on check-in.

To check node’s connectivity to the satellites you cannot use API, until your network issue is fixed.
The network issues can be monitored in your logs - there is should not be any error like "ping satellite failed".

onlineScore is correct, if your node configured properly. If it has a misconfigured network, it will be outdated.

So, first of all - please fix all issues with your network, then configure proper monitoring (if you want).

dugwood · March 16, 2022, 9:48am

Again, not a network issue… Or perhaps weeks ago, then node is disqualified and we don’t know until node is restarted.

There should be a valid metric to check EVEN if network is broken. That’s the whole idea: know that the node has a global issue, without testing it from outside. Because UptimeRobot is up, port is open, so it SEEMS fine from the outside, but it’s apparently not.

Say I want to check my home has Internet. I’ll setup a «ping» to google.com or the like. From the outside, I can ping my home too, but perhaps it will work, but from my home DNS is broken then I can’t ping google.com. I want the same thing here: to know from the inside that the node is fine.

onlineScore is out of the picture, as you said «if network issues». Or it should be combined with auditHistory having a timestamp not too far in the past.

I won’t monitor ping satellite failed, as the text may change, and I’ll think node is okay but it’s not. As a matter of fact, my current node (disqualified) doesn’t produce it but I have a lot of satellite XXX is untrusted.

So what should it be? As you can read around this forum, many people have got a drop in online score, complain about disqualification email sent «on node restart», and the like. It should be easier than setting up Grafana just to have an email. A simple bash script should do it. So far I was doing it this way using anyScore lower than 0.95, adding a test on last auditHistory may be the only thing missing here.

Alexey · March 16, 2022, 7:49pm

The suspended node is not the same as disqualified. Disqualification is permanent and unrecoverable. The suspension can be recovered with fixing an underlaying issue.
So, please clarify - is your node disqualified or suspended?

If your node doesn’t answer on audit requests when it’s running - it’s a network issue, there is no other way to be offline.

You can use the lastPinged to determine is your node actually online:

curl -L localhost:14002/api/sno | jq '.lastPinged'

or for docker

docker exec -i storagenode wget -qO - localhost:14002/api/sno | jq '.lastPinged'

if lastPinged is older than a hour (default check interval) - you should be concerned.

dugwood · March 17, 2022, 3:16am

Thanks @Alexey I’ll use this along with onlineScore.

dugwood · March 17, 2022, 9:22am

@Alexey my node was suspended then disqualified. Both mails were sent when I restarted my node in March, but both decisions were timed in January in the received mails.

Pac · March 17, 2022, 12:43pm

It feels to me like there’s something wrong on Storj’s side if the satellite alerts SNOs only when their nodes are online…
I mean, the suspension e-mail is all about telling the SNO something wrong needs to be fixed ASAP (like being offline) so it makes no sense not to send this alert e-mail if the node is offline.

I don’t want to have to set up the curl -L localhost:14002/api/sno | jq '.lastPinged' thingie myself in order to check whether my node is online or not.
In my opinion, something needs to be fixed.

I’ll have a look at the Github issue @Alexey linked above ([storagenode] The email about suspension is coming only after check-in to satellite · Issue #4235 · storj/storj · GitHub).