OnlineScore went to zero in one day

dugwood · December 27, 2021, 8:02pm

Hello, I’m running nodes for a while now, and added monitoring of «audit» for months. I’m monitoring «onlineScore» to be higher than 0.95, as I had a faulty node once, and it works great so far.

But I’ve restarted all my servers, hence my nodes, on Sunday (26th). Of 11 nodes, 3 just went from onlineScore > 95% (monitoring) to 0 today. How it this possible?

The only thing I’ve spotted so far is that I was setting an empty email address (typo in my bash restart script).

I’ve tried the script from Alexey for audit (My uptime should be 100% on all satellites i have not gotten any uptime robot notifications of downtime in months - #2 by Alexey), but it doesn’t work (either on good or bad nodes, nothing is printed in my console).

Got an issue with DNS, but restarting Docker fixed it (same issue with good nodes).

I’m monitoring Storj ports with uptimerobot, and everything’s fine. I’ve even opened UDP ports to be sure, nothing’s working so far…

Funny thing: all my «good» nodes are showing «Status OFFLINE» in the docker status’ page, all my «bad» nodes are showing «Status ONLINE»

Thanks!

dugwood · December 27, 2021, 8:06pm

Just received an email (because I didn’t set the good one as explained):

Your Storage Node on the asia-east-1 Satellite was suspended because it was offline too often during audits.

You were suspended on 2021-10-24 at 01:18 UTC.

You won’t receive any new data on this Satellite until you resolve the issue causing your node to be offline. See [here] for common solutions.

That’s funny

Pac · December 27, 2021, 8:18pm

Hello @dugwood and welcome to the forum!

A node’s online score is not supposed to drop to 0 in a couple of days as it is a 30 days moving average. So that’s definitely weird. Maybe your nodes were actually off-grid, preventing them from getting score updates from satellites?

It’d be a good idea to check your nodes’ logs out, and look for errors for the past couple of days.

How are your other scores looking?
Are your nodes sending and receiving data?
Are your disks keeping up? (iowait?)

How are you checking connectivity with uptimerobot?
You should check for opened ports.

Double check your port redirections, but if uptimerobot is able to reach all your nodes’ ports, I’m not sure what’s wrong really

Pac · December 27, 2021, 8:45pm

How to check your logs for errors:

And more precisely, to find out what is causing the suspension:

dugwood · December 28, 2021, 4:54am

Hello @Pac,

Thanks for your input. Clearly it’s an old issue that just went visible when I’ve fixed my email + after restart of nodes.

Tests:

uptimerobot looks for TCP open port 28967. Upon restart of servers, it goes down as expected, so the test seems to be good.
my local test is to extract audit log, and test if:
- there’s at least 10 metrics including «Score»
- none of them are below 0.95
  => this test was fine until my reboot on Sunday

Nothing useful in logs:

2021-12-27T19:36:56.732Z 2021-12-27T19:36:56.733Z 2021-12-27T19:36:56.733Z 2021-12-27T19:36:57.334Z 2021-12-27T19:36:58.735Z 2021-12-27T19:36:59.579Z 2021-12-27T19:37:00.448Z 2021-12-27T19:37:00.448Z 2021-12-27T19:37:00.448Z 2021-12-27T19:37:00.448Z 2021-12-27T19:37:00.448Z 2021-12-27T19:37:00.448Z 2021-12-27T19:52:55.790Z MtSkmKxvvAW" is ol).GetNodeURL:177\n\tst outer).ServeHTTP:210\n\t 2021-12-27T20:37:00.449Z 2021-12-27T21:37:00.449Z 2021-12-27T22:37:00.449Z 2021-12-27T23:37:00.449Z 2021-12-28T00:37:00.449Z 2021-12-28T01:37:00.448Z 2021-12-28T01:44:25.022Z 2021-12-28T02:37:00.449Z 2021-12-28T03:00:06.242Z MtSkmKxvvAW" is ol).GetNodeURL:177\n\tst outer).ServeHTTP:210\n\t 2021-12-28T03:37:00.448Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
INFO Operator email {“Address”: “xxx”}
INFO Operator wallet {“Address”: “xxx”}
INFO Telemetry enabled {“instance ID”: “xxx”}
INFO db.migration Database Version {“version”: 53}
INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
INFO Node xxx started
INFO Public server started on [::]:28967
INFO bandwidth Performing bandwidth usage rollups
INFO Private server started on 127.0.0.1:7778
INFO trust Scheduling next refresh {“after”: “6h7m24.495681573s”}
WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
untrusted”, “errorVerbose”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW" is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
orj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
net/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
INFO bandwidth Performing bandwidth usage rollups
INFO bandwidth Performing bandwidth usage rollups
INFO bandwidth Performing bandwidth usage rollups
INFO bandwidth Performing bandwidth usage rollups
INFO bandwidth Performing bandwidth usage rollups
INFO bandwidth Performing bandwidth usage rollups
INFO trust Scheduling next refresh {“after”: “6h18m21.048507951s”}
INFO bandwidth Performing bandwidth usage rollups
WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
untrusted”, “errorVerbose”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW" is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
orj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
net/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
INFO bandwidth Performing bandwidth usage rollups

Only warning is about a dead external node, I’ve found it on Google too.

My guess:

misconfigured email prevented me from getting warnings (even if my tests were added in July and all the suspensions were dated in October):

image1288×302 92.9 KB
restarting may have triggered some kind of reset that forbid my nodes to reconnect again.

My suggestions:

forbid start of node if email is missing, so change WARN to ERROR and halt the script:

2021-12-27T09:32:02.003Z WARN Operator email address isn’t specified.

prevent onlineScore to stay at 95%+ if node is disqualified

So far I’ll recreate the faulty nodes, as it seems I don’t have any other choices.

Best regards.
(IP removed, don’t worry about it, it’s scanned every second by bots anyway!)

Alexey · January 3, 2022, 6:14am

Unfortunately email warnings about suspension is coming from the satellites when your node become online. This is a complicated issue, and we want to offer do not relay on our emails warnings, use the [Tech Preview] Email alerts with Grafana and Prometheus instead.

Regarding online score - it’s dropping when the satellite could not contact your node for any reason, each satellite do it independently, so your node may be online on one satellite and offline on another (for example, your or ISP firewall blocking incoming traffic from that satellite).
So, I would recommend to check your and ISP firewall settings and logs and make sure that they does not block traffic to/from your node for specific IPs or something like that.

dugwood · January 3, 2022, 6:51am

Thanks Alexey. The point on email is just to warn that no email was set. My code was using $EMAIL and I set MAIL in the script, so I was sending a blank email address, which should be forbidden. Of course, I could have mispelled my email… But it’s a specific issue, as very few people should have created a startup script as I did.

For the online score, I can guaranteed you that I was over 95% weeks after been banned. So it’s a combination of node rebooting and ban that led to a restart at 0%. That’s when I found out.

dugwood · January 29, 2022, 9:27am

Hello @Alexey & @Pac, I think there’s an issue with the online score returned by satellites. It’s a bit linked with my previous issue, so I use the same thread.

History of events:

no monitoring issue on audits (so all nodes give at least 0.95 onlineScore), which are done at 10:10 and 20:10 everyday.
January 22 10:10: test of audits OK (at least 0.95)
January 22 20:51: server goes down (electrical issue at hosting company)
January 22 21:22: server is back online
January 22 21:40: external Storj test sends OK signal (Storj is up and running on the standard Storj port)

So, all in all, the downtime was less than one hour.

January 23 10:10: audit check is not OK: onlineScore 0.13269756959185705
January 24 10:11: onlineScore = 0.13269756959185705 (same?!)
January 25 10:10: onlineScore = 0.13269756959185705 (same!!)
January 26 10:10: onlineScore = 0.13269756959185705 (same!!)
January 27 10:10: onlineScore = 0.13627450980392156 (got 0.4% increase)
January 28 10:10: onlineScore = 0.16960784313725488 (got 3.3% increase)

For your information: it was the values from europe-north-1.tardigrade.io:7777 (I always check all servers, so the last one with less than 0.95 will give it’s onlineScore).
One server was at 100% onlineScore for the whole time: us2.storj.io:7777, which seems wrong too…

So what is going on? Can I provide more information about my node? It clearly went from 100% to 13% for 1 hour of downtime…

Alexey · January 29, 2022, 3:22pm

The online score calculated for 30 days window, so your node should move on from the offline window to 30 days further, then it will recover back to 100%
Every offline event will require the next 30 days to be online to recover.

Toyoo · January 29, 2022, 3:40pm

I think it would be nice for node operators to have feedback faster than after 30 days.

Would it be possible to add an additional metric, maybe something simple like «Time since last failed online check»? Or, let say, keeping the score convention, compute the score itself from slightly weighted daily scores?

Alexey · January 29, 2022, 3:44pm

You can add this feature request to the Storage Node feature requests - voting - Storj Community Forum (official) or on GitHub

FYI:This online score system was not made for feedback, it’s a metric, part of the reputation.

Toyoo · January 29, 2022, 3:50pm

And what’s the purpose of metrics? To induce feedback when necessary.

Alexey · January 29, 2022, 3:54pm

To show details of the reputation.
What I want to say - these metrics was not made to be reactive
Even usage is updated once at 12 hours. The online score is calculated for 30 days window, it’s designed like that.
See

github.com

storj/storj/blob/6b6e9901e2afe70e1173e9701f217f2dd0f1a757/docs/blueprints/storage-node-downtime-tracking-with-audits.md

# Storage Node Downtime Tracking With Audits

## Abstract

This document describes a means of tracking storage node downtime with audits and using this information to suspend and disqualify.

## Background

The previous implementation of uptime reputation consisted of a ratio of online audits to offline audits. We encountered a problem where some nodes' reputations would quickly become destroyed over a relatively short period of downtime due to the frequency of auditing any particular node being directly correlated with the number of pieces it holds. To solve this problem we need a system that takes into account not only how many offline audits occur, but _when_ they occur as well.

## Design

The solution proposed here is to use a series of sliding windows to indicate a general timeframe in which offline audits occur. Each window keeps two separate tallies indicating how many online audits and total audits a particular node received within its timeframe. Once a window is complete, it is scored by calculating the percentage of total audits for which it was online. We can average these scores over a trailing period of time, called the _tracking period_, to determine an overall "online score" to be used for suspension and disqualification. By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

Storage node downtime can have a range of causes. For those storage node operators who may have fallen victim to a temporary issue, we want to give them a chance to diagnose and fix it before disqualifying them for good. For this reason, we are introducing suspension as a component of disqualification.

Once a node's online score has fallen below an _offline threshold_, it is _suspended_ and enters a period of review. A suspended node will not receive any new pieces, but can continue to receive download and audit requests for the pieces it currently holds. However, its pieces are considered to be unhealthy. We repair a segment if it contains too many unhealthy pieces, at which point we may transfer the repaired pieces from a suspended node to a more reliable node. If at any point during the review period we find that a node's score has risen above the offline threshold, it is unsuspended, or _reinstated_, but it remains _under review_. This prevents nodes from alternating between suspension and reinstatement without consequence.

The review period consists of one _grace period_ and one _tracking period_. The _grace period_ is given to fix whatever issue is causing the downtime. After the grace period has expired, any offline audits will fall within the scope of the tracking period, and thus will be used in the node's final evaluation. If at the end of the review period, the node is still suspended, it is disqualified. Otherwise, the node is no longer _under review_.

This file has been truncated. show original

It’s really can help, if you could submit an own version for the blueprint. The team will review it and you can made the change.

Toyoo · January 29, 2022, 4:21pm

Judging from this document’s level of details, I’d have to know much more about the internals of the satellite code than I do now to suggest anything reasonable, sorry.

I suspect that what I have in mind would be to have:

By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.

modified to allow some small weighting with window’s age. I don’t know how big are the specific windows used, so I can’t exactly suggest any specifics. If they’re daily, weights like 1 - (age_in_days / 30 / 10) (making the oldest window be weighted at 0.9) would already make recovery visible early in the fractional part of the online score, while not changing the score values significantly enough to affect suspension logic.

dugwood · March 13, 2022, 6:15am

Sorry @Alexey, I don’t received updates from the forum… (OK, found the setting, but it should be enabled by default, as I’ve checked «Watching»).

I’ve just restarted another node today, and it went directly to zero. I’m sorry, but you DO have an issue here:

Before restart:

"12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0.9998307952622674,
  "satelliteName": "europe-north-1.tardigrade.io:7777"
}

After restart:

"12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0,
  "satelliteName": "europe-north-1.tardigrade.io:7777"
}

(same thing with all other satellites).

In the same time, I’ve received an email with:

You were suspended on 2022-01-09 at 06:21 UTC.

So the email tells me that my node was suspended… in January!

Partial node ID if you want to check something: 1E4A9cAijMMaL1DfD7VP6oYkqEoiHMe

So the topic is still valid: went to zero in one day!

Can you please have a deeper look at this? Should I restart this node from scratch and lose every history?

Thanks.

Alexey · March 15, 2022, 9:58pm

The online score is updated from the satellites on every check-in. When you restarted the node, it was checked in on the satellites and got an update.

Just keep it online, the online score should recover during the next 30 days online.
Each downtime requires another 30 days online.

dugwood · March 16, 2022, 5:31am

Hi @Alexey,

My node was disqualified too, so I think I can’t get it back online as you said. Can you confirm that a disqualified node is disqualified for life or not?

And it’s a big issue: if I can’t know that I’m suspended/disqualified until I’ve restarted the node, it’s a big trouble!

As I’ve stated, my online, suspension and audit scores were perfect, and after restarting online score only went to zero (restart took 30 min, not 30 days…). So there’s a very big issue there!!

So you’re saying we have to restart our nodes every day so that we know if our online score gets hurt? It’s a no go to me. And other nodes clearly show updates on online scores without restarting nodes.

Please, have a look at what went wrong, I can send you the full Node ID in a private message if needed.

Alexey · March 16, 2022, 6:21am

Drop in online score != disqualification.
If your node is disqualified - it will not recover. The disqualification permanent and not reversible, it has nothing in common with downtime (unless your node was offline for more than 30 days).
The disqualification usually happened when your node managed to lost or corrupt data.

The suspension != disqualification.
The suspension can be applied in two cases:

Your suspension score is lower than 60%
Your online score is lower than 60%

The suspension score is affected when your node is online, answers on audit requests, but returns unknown errors instead of pieces. If your node would start to answer on audit requests (GET_AUDIT and GET_REPAIR) normally, the suspension score would be quickly recover with each passed audit. If your node answers with known errors like “file not found”, “disk i/o” or with corrupted pieces, it will affect the audit score instead.

If audit score would drop below 60% your node will be disqualified.

The online score is affected when your node doesn’t answer on audit requests at all. As soon as your node start to answer on audit requests, the online score would be slowly recover. To fully recover it requires to be 30 days online. Each downtime requires another 30 days to recover.

All scores are received from the satellites. If your node has been offline long enough to reset your online score to zero, it will be updated as soon as you manage to bring it online - the node will receive updated scores from the satellites. In your case, the online score is zero.

All three metrics are independent. Your node can be disqualified without affecting online score or suspension score, if the audit score would fall below 60%, the node will be disqualified.

dugwood · March 16, 2022, 6:26am

@Alexey thanks for clarifications. But that doesn’t help at all, since my audit and suspension scores NEVER gone below 95% (I do monitor these every single day). As of now, it states:

"12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"
{
  "auditScore": 1,
  "suspensionScore": 1,
  "onlineScore": 0,
  "satelliteName": "us2.storj.io:7777"
}

So you’re saying that audit is working (as it’s 100% OK), but online is 0%… How can I have 0% online if audit is 100%?!

I’ll recreate the node as it has been disqualified. But there a HUGE issue on your end, you MUST fix it. Or audit and suspension scores are useless…

Node information:

ID     1E4A9cAijMMaL1DfD7VP6oYkqEoiHMeBskkuBeYG7ZMjukmMRt
Status ONLINE
Uptime 73h2m11s

                   Available          Used     Egress     Ingress
     Bandwidth           N/A           0 B        0 B         0 B (since Mar 1)
          Disk     134.71 GB     415.29 GB

Alexey · March 16, 2022, 6:28am

Again. All three metrics are independent.
The online score 0 mean that your node has no one answered audit for the last 30 days.
You can check this: