OnlineScore went to zero in one day

Hello, I’m running nodes for a while now, and added monitoring of «audit» for months. I’m monitoring «onlineScore» to be higher than 0.95, as I had a faulty node once, and it works great so far.

But I’ve restarted all my servers, hence my nodes, on Sunday (26th). Of 11 nodes, 3 just went from onlineScore > 95% (monitoring) to 0 today. How it this possible?

The only thing I’ve spotted so far is that I was setting an empty email address (typo in my bash restart script).

I’ve tried the script from Alexey for audit (My uptime should be 100% on all satellites i have not gotten any uptime robot notifications of downtime in months - #2 by Alexey), but it doesn’t work (either on good or bad nodes, nothing is printed in my console).

Got an issue with DNS, but restarting Docker fixed it (same issue with good nodes).

I’m monitoring Storj ports with uptimerobot, and everything’s fine. I’ve even opened UDP ports to be sure, nothing’s working so far…

Funny thing: all my «good» nodes are showing «Status OFFLINE» in the docker status’ page, all my «bad» nodes are showing «Status ONLINE» :smile:

Thanks!

Just received an email (because I didn’t set the good one as explained):

Your Storage Node on the asia-east-1 Satellite was suspended because it was offline too often during audits.

You were suspended on 2021-10-24 at 01:18 UTC.

You won’t receive any new data on this Satellite until you resolve the issue causing your node to be offline. See [here] for common solutions.

That’s funny :smiley:

Hello @dugwood and welcome to the forum!

A node’s online score is not supposed to drop to 0 in a couple of days as it is a 30 days moving average. So that’s definitely weird. Maybe your nodes were actually off-grid, preventing them from getting score updates from satellites?

It’d be a good idea to check your nodes’ logs out, and look for errors for the past couple of days.

  • How are your other scores looking?
  • Are your nodes sending and receiving data?
  • Are your disks keeping up? (iowait?)

How are you checking connectivity with uptimerobot?
You should check for opened ports.

Double check your port redirections, but if uptimerobot is able to reach all your nodes’ ports, I’m not sure what’s wrong really :confused:

2 Likes

How to check your logs for errors:

And more precisely, to find out what is causing the suspension:

1 Like

Hello @Pac,

Thanks for your input. Clearly it’s an old issue that just went visible when I’ve fixed my email + after restart of nodes.

Tests:

  • uptimerobot looks for TCP open port 28967. Upon restart of servers, it goes down as expected, so the test seems to be good.
  • my local test is to extract audit log, and test if:
    • there’s at least 10 metrics including «Score»
    • none of them are below 0.95
      => this test was fine until my reboot on Sunday

Nothing useful in logs:

2021-12-27T19:36:56.732Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
2021-12-27T19:36:56.733Z INFO Operator email {“Address”: “xxx”}
2021-12-27T19:36:56.733Z INFO Operator wallet {“Address”: “xxx”}
2021-12-27T19:36:57.334Z INFO Telemetry enabled {“instance ID”: “xxx”}
2021-12-27T19:36:58.735Z INFO db.migration Database Version {“version”: 53}
2021-12-27T19:36:59.579Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
2021-12-27T19:37:00.448Z INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
2021-12-27T19:37:00.448Z INFO Node xxx started
2021-12-27T19:37:00.448Z INFO Public server started on [::]:28967
2021-12-27T19:37:00.448Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T19:37:00.448Z INFO Private server started on 127.0.0.1:7778
2021-12-27T19:37:00.448Z INFO trust Scheduling next refresh {“after”: “6h7m24.495681573s”}
2021-12-27T19:52:55.790Z WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
MtSkmKxvvAW” is untrusted”, “errorVerbose”: “console: trust: satellite “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW” is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
ol).GetNodeURL:177\n\tstorj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
outer).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
2021-12-27T20:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T21:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T22:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T23:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T00:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T01:37:00.448Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T01:44:25.022Z INFO trust Scheduling next refresh {“after”: “6h18m21.048507951s”}
2021-12-28T02:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T03:00:06.242Z WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
MtSkmKxvvAW” is untrusted”, “errorVerbose”: “console: trust: satellite “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW” is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
ol).GetNodeURL:177\n\tstorj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
outer).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
2021-12-28T03:37:00.448Z INFO bandwidth Performing bandwidth usage rollups

Only warning is about a dead external node, I’ve found it on Google too.

My guess:

  • misconfigured email prevented me from getting warnings (even if my tests were added in July and all the suspensions were dated in October):
  • restarting may have triggered some kind of reset that forbid my nodes to reconnect again.

My suggestions:

  • forbid start of node if email is missing, so change WARN to ERROR and halt the script:

2021-12-27T09:32:02.003Z WARN Operator email address isn’t specified.

  • prevent onlineScore to stay at 95%+ if node is disqualified

So far I’ll recreate the faulty nodes, as it seems I don’t have any other choices.

Best regards.
(IP removed, don’t worry about it, it’s scanned every second by bots anyway!)

Unfortunately email warnings about suspension is coming from the satellites when your node become online. This is a complicated issue, and we want to offer do not relay on our emails warnings, use the [Tech Preview] Email alerts with Grafana and Prometheus instead.

Regarding online score - it’s dropping when the satellite could not contact your node for any reason, each satellite do it independently, so your node may be online on one satellite and offline on another (for example, your or ISP firewall blocking incoming traffic from that satellite).
So, I would recommend to check your and ISP firewall settings and logs and make sure that they does not block traffic to/from your node for specific IPs or something like that.

Thanks Alexey. The point on email is just to warn that no email was set. My code was using $EMAIL and I set MAIL in the script, so I was sending a blank email address, which should be forbidden. Of course, I could have mispelled my email… But it’s a specific issue, as very few people should have created a startup script as I did.

For the online score, I can guaranteed you that I was over 95% weeks after been banned. So it’s a combination of node rebooting and ban that led to a restart at 0%. That’s when I found out.

1 Like

Hello @Alexey & @Pac, I think there’s an issue with the online score returned by satellites. It’s a bit linked with my previous issue, so I use the same thread.

History of events:

  • no monitoring issue on audits (so all nodes give at least 0.95 onlineScore), which are done at 10:10 and 20:10 everyday.
  • January 22 10:10: test of audits OK (at least 0.95)
  • January 22 20:51: server goes down (electrical issue at hosting company)
  • January 22 21:22: server is back online
  • January 22 21:40: external Storj test sends OK signal (Storj is up and running on the standard Storj port)

So, all in all, the downtime was less than one hour.

  • January 23 10:10: audit check is not OK: onlineScore 0.13269756959185705
  • January 24 10:11: onlineScore = 0.13269756959185705 (same?!)
  • January 25 10:10: onlineScore = 0.13269756959185705 (same!!)
  • January 26 10:10: onlineScore = 0.13269756959185705 (same!!)
  • January 27 10:10: onlineScore = 0.13627450980392156 (got 0.4% increase)
  • January 28 10:10: onlineScore = 0.16960784313725488 (got 3.3% increase)

For your information: it was the values from europe-north-1.tardigrade.io:7777 (I always check all servers, so the last one with less than 0.95 will give it’s onlineScore).
One server was at 100% onlineScore for the whole time: us2.storj.io:7777, which seems wrong too…

So what is going on? Can I provide more information about my node? It clearly went from 100% to 13% for 1 hour of downtime…

1 Like

The online score calculated for 30 days window, so your node should move on from the offline window to 30 days further, then it will recover back to 100%
Every offline event will require the next 30 days to be online to recover.

I think it would be nice for node operators to have feedback faster than after 30 days.

Would it be possible to add an additional metric, maybe something simple like «Time since last failed online check»? Or, let say, keeping the score convention, compute the score itself from slightly weighted daily scores?

You can add this feature request to the Storage Node feature requests - voting - Storj Community Forum (official) or on GitHub

FYI:This online score system was not made for feedback, it’s a metric, part of the reputation.

And what’s the purpose of metrics? To induce feedback when necessary. :man_shrugging:

To show details of the reputation.
What I want to say - these metrics was not made to be reactive :slight_smile:
Even usage is updated once at 12 hours. The online score is calculated for 30 days window, it’s designed like that.
See

It’s really can help, if you could submit an own version for the blueprint. The team will review it and you can made the change.

Judging from this document’s level of details, I’d have to know much more about the internals of the satellite code than I do now to suggest anything reasonable, sorry.

I suspect that what I have in mind would be to have:

By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

modified to allow some small weighting with window’s age. I don’t know how big are the specific windows used, so I can’t exactly suggest any specifics. If they’re daily, weights like 1 - (age_in_days / 30 / 10) (making the oldest window be weighted at 0.9) would already make recovery visible early in the fractional part of the online score, while not changing the score values significantly enough to affect suspension logic.

Sorry @Alexey, I don’t received updates from the forum… (OK, found the setting, but it should be enabled by default, as I’ve checked «Watching»).

I’ve just restarted another node today, and it went directly to zero. I’m sorry, but you DO have an issue here:

Before restart:

"12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0.9998307952622674,
  "satelliteName": "europe-north-1.tardigrade.io:7777"
}

After restart:

"12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0,
  "satelliteName": "europe-north-1.tardigrade.io:7777"
}

(same thing with all other satellites).

In the same time, I’ve received an email with:

You were suspended on 2022-01-09 at 06:21 UTC.

So the email tells me that my node was suspended… in January!

Partial node ID if you want to check something: 1E4A9cAijMMaL1DfD7VP6oYkqEoiHMe

So the topic is still valid: went to zero in one day!

Can you please have a deeper look at this? Should I restart this node from scratch and lose every history?

Thanks.

The online score is updated from the satellites on every check-in. When you restarted the node, it was checked in on the satellites and got an update.

Just keep it online, the online score should recover during the next 30 days online.
Each downtime requires another 30 days online.

Hi @Alexey,

My node was disqualified too, so I think I can’t get it back online as you said. Can you confirm that a disqualified node is disqualified for life or not?

And it’s a big issue: if I can’t know that I’m suspended/disqualified until I’ve restarted the node, it’s a big trouble!

As I’ve stated, my online, suspension and audit scores were perfect, and after restarting online score only went to zero (restart took 30 min, not 30 days…). So there’s a very big issue there!!

So you’re saying we have to restart our nodes every day so that we know if our online score gets hurt? It’s a no go to me. And other nodes clearly show updates on online scores without restarting nodes.

Please, have a look at what went wrong, I can send you the full Node ID in a private message if needed.

Drop in online score != disqualification.
If your node is disqualified - it will not recover. The disqualification permanent and not reversible, it has nothing in common with downtime (unless your node was offline for more than 30 days).
The disqualification usually happened when your node managed to lost or corrupt data.

The suspension != disqualification.
The suspension can be applied in two cases:

  1. Your suspension score is lower than 60%
  2. Your online score is lower than 60%

The suspension score is affected when your node is online, answers on audit requests, but returns unknown errors instead of pieces. If your node would start to answer on audit requests (GET_AUDIT and GET_REPAIR) normally, the suspension score would be quickly recover with each passed audit. If your node answers with known errors like “file not found”, “disk i/o” or with corrupted pieces, it will affect the audit score instead.

If audit score would drop below 60% your node will be disqualified.

The online score is affected when your node doesn’t answer on audit requests at all. As soon as your node start to answer on audit requests, the online score would be slowly recover. To fully recover it requires to be 30 days online. Each downtime requires another 30 days to recover.

All scores are received from the satellites. If your node has been offline long enough to reset your online score to zero, it will be updated as soon as you manage to bring it online - the node will receive updated scores from the satellites. In your case, the online score is zero.

All three metrics are independent. Your node can be disqualified without affecting online score or suspension score, if the audit score would fall below 60%, the node will be disqualified.

@Alexey thanks for clarifications. But that doesn’t help at all, since my audit and suspension scores NEVER gone below 95% (I do monitor these every single day). As of now, it states:

"12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0,
  "satelliteName": "us2.storj.io:7777"
}

So you’re saying that audit is working (as it’s 100% OK), but online is 0%… How can I have 0% online if audit is 100%?!

I’ll recreate the node as it has been disqualified. But there a HUGE issue on your end, you MUST fix it. Or audit and suspension scores are useless…

Node information:

ID     1E4A9cAijMMaL1DfD7VP6oYkqEoiHMeBskkuBeYG7ZMjukmMRt
Status ONLINE
Uptime 73h2m11s

                   Available          Used     Egress     Ingress
     Bandwidth           N/A           0 B        0 B         0 B (since Mar 1)
          Disk     134.71 GB     415.29 GB

Again. All three metrics are independent.
The online score 0 mean that your node has no one answered audit for the last 30 days.
You can check this: