Hello, I’m running nodes for a while now, and added monitoring of «audit» for months. I’m monitoring «onlineScore» to be higher than 0.95, as I had a faulty node once, and it works great so far.
But I’ve restarted all my servers, hence my nodes, on Sunday (26th). Of 11 nodes, 3 just went from onlineScore > 95% (monitoring) to 0 today. How it this possible?
The only thing I’ve spotted so far is that I was setting an empty email address (typo in my bash restart script).
A node’s online score is not supposed to drop to 0 in a couple of days as it is a 30 days moving average. So that’s definitely weird. Maybe your nodes were actually off-grid, preventing them from getting score updates from satellites?
It’d be a good idea to check your nodes’ logs out, and look for errors for the past couple of days.
How are your other scores looking?
Are your nodes sending and receiving data?
Are your disks keeping up? (iowait?)
How are you checking connectivity with uptimerobot?
You should check for opened ports.
Double check your port redirections, but if uptimerobot is able to reach all your nodes’ ports, I’m not sure what’s wrong really
Thanks for your input. Clearly it’s an old issue that just went visible when I’ve fixed my email + after restart of nodes.
Tests:
uptimerobot looks for TCP open port 28967. Upon restart of servers, it goes down as expected, so the test seems to be good.
my local test is to extract audit log, and test if:
there’s at least 10 metrics including «Score»
none of them are below 0.95
=> this test was fine until my reboot on Sunday
Nothing useful in logs:
2021-12-27T19:36:56.732Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
2021-12-27T19:36:56.733Z INFO Operator email {“Address”: “xxx”}
2021-12-27T19:36:56.733Z INFO Operator wallet {“Address”: “xxx”}
2021-12-27T19:36:57.334Z INFO Telemetry enabled {“instance ID”: “xxx”}
2021-12-27T19:36:58.735Z INFO db.migration Database Version {“version”: 53}
2021-12-27T19:36:59.579Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
2021-12-27T19:37:00.448Z INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
2021-12-27T19:37:00.448Z INFO Node xxx started
2021-12-27T19:37:00.448Z INFO Public server started on [::]:28967
2021-12-27T19:37:00.448Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T19:37:00.448Z INFO Private server started on 127.0.0.1:7778
2021-12-27T19:37:00.448Z INFO trust Scheduling next refresh {“after”: “6h7m24.495681573s”}
2021-12-27T19:52:55.790Z WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
MtSkmKxvvAW" is untrusted”, “errorVerbose”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW" is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
ol).GetNodeURL:177\n\tstorj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
outer).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
2021-12-27T20:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T21:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T22:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-27T23:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T00:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T01:37:00.448Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T01:44:25.022Z INFO trust Scheduling next refresh {“after”: “6h18m21.048507951s”}
2021-12-28T02:37:00.449Z INFO bandwidth Performing bandwidth usage rollups
2021-12-28T03:00:06.242Z WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJex
MtSkmKxvvAW" is untrusted”, “errorVerbose”: “console: trust: satellite "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW" is untrusted\n\tstorj.io/storj/storagenode/trust.(*Pool).getInfo:238\n\tstorj.io/storj/storagenode/trust.(*Po
ol).GetNodeURL:177\n\tstorj.io/storj/storagenode/console.(*Service).GetDashboardData:174\n\tstorj.io/storj/storagenode/console/consoleapi.(*StorageNode).StorageNode:45\n\tnet/http.HandlerFunc.ServeHTTP:2047\n\tgithub.com/gorilla/mux.(*R
outer).ServeHTTP:210\n\tnet/http.serverHandler.ServeHTTP:2879\n\tnet/http.(*conn).serve:1930”}
2021-12-28T03:37:00.448Z INFO bandwidth Performing bandwidth usage rollups
Only warning is about a dead external node, I’ve found it on Google too.
My guess:
misconfigured email prevented me from getting warnings (even if my tests were added in July and all the suspensions were dated in October):
Unfortunately email warnings about suspension is coming from the satellites when your node become online. This is a complicated issue, and we want to offer do not relay on our emails warnings, use the [Tech Preview] Email alerts with Grafana and Prometheus instead.
Regarding online score - it’s dropping when the satellite could not contact your node for any reason, each satellite do it independently, so your node may be online on one satellite and offline on another (for example, your or ISP firewall blocking incoming traffic from that satellite).
So, I would recommend to check your and ISP firewall settings and logs and make sure that they does not block traffic to/from your node for specific IPs or something like that.
Thanks Alexey. The point on email is just to warn that no email was set. My code was using $EMAIL and I set MAIL in the script, so I was sending a blank email address, which should be forbidden. Of course, I could have mispelled my email… But it’s a specific issue, as very few people should have created a startup script as I did.
For the online score, I can guaranteed you that I was over 95% weeks after been banned. So it’s a combination of node rebooting and ban that led to a restart at 0%. That’s when I found out.
Hello @Alexey & @Pac, I think there’s an issue with the online score returned by satellites. It’s a bit linked with my previous issue, so I use the same thread.
History of events:
no monitoring issue on audits (so all nodes give at least 0.95 onlineScore), which are done at 10:10 and 20:10 everyday.
January 22 10:10: test of audits OK (at least 0.95)
January 22 20:51: server goes down (electrical issue at hosting company)
January 22 21:22: server is back online
January 22 21:40: external Storj test sends OK signal (Storj is up and running on the standard Storj port)
So, all in all, the downtime was less than one hour.
January 23 10:10: audit check is not OK: onlineScore 0.13269756959185705
January 24 10:11: onlineScore = 0.13269756959185705 (same?!)
January 25 10:10: onlineScore = 0.13269756959185705 (same!!)
January 26 10:10: onlineScore = 0.13269756959185705 (same!!)
January 27 10:10: onlineScore = 0.13627450980392156 (got 0.4% increase)
January 28 10:10: onlineScore = 0.16960784313725488 (got 3.3% increase)
For your information: it was the values from europe-north-1.tardigrade.io:7777 (I always check all servers, so the last one with less than 0.95 will give it’s onlineScore).
One server was at 100% onlineScore for the whole time: us2.storj.io:7777, which seems wrong too…
So what is going on? Can I provide more information about my node? It clearly went from 100% to 13% for 1 hour of downtime…
The online score calculated for 30 days window, so your node should move on from the offline window to 30 days further, then it will recover back to 100%
Every offline event will require the next 30 days to be online to recover.
I think it would be nice for node operators to have feedback faster than after 30 days.
Would it be possible to add an additional metric, maybe something simple like «Time since last failed online check»? Or, let say, keeping the score convention, compute the score itself from slightly weighted daily scores?
To show details of the reputation.
What I want to say - these metrics was not made to be reactive
Even usage is updated once at 12 hours. The online score is calculated for 30 days window, it’s designed like that.
See
It’s really can help, if you could submit an own version for the blueprint. The team will review it and you can made the change.
Judging from this document’s level of details, I’d have to know much more about the internals of the satellite code than I do now to suggest anything reasonable, sorry.
I suspect that what I have in mind would be to have:
By granting each individual window the same weight in the calculation of the overall average, the effect of any particularly unlucky period can be minimized while still allowing us to take the failures into account over a longer period.
modified to allow some small weighting with window’s age. I don’t know how big are the specific windows used, so I can’t exactly suggest any specifics. If they’re daily, weights like 1 - (age_in_days / 30 / 10) (making the oldest window be weighted at 0.9) would already make recovery visible early in the fractional part of the online score, while not changing the score values significantly enough to affect suspension logic.
My node was disqualified too, so I think I can’t get it back online as you said. Can you confirm that a disqualified node is disqualified for life or not?
And it’s a big issue: if I can’t know that I’m suspended/disqualified until I’ve restarted the node, it’s a big trouble!
As I’ve stated, my online, suspension and audit scores were perfect, and after restarting online score only went to zero (restart took 30 min, not 30 days…). So there’s a very big issue there!!
So you’re saying we have to restart our nodes every day so that we know if our online score gets hurt? It’s a no go to me. And other nodes clearly show updates on online scores without restarting nodes.
Please, have a look at what went wrong, I can send you the full Node ID in a private message if needed.
Drop in online score != disqualification.
If your node is disqualified - it will not recover. The disqualification permanent and not reversible, it has nothing in common with downtime (unless your node was offline for more than 30 days).
The disqualification usually happened when your node managed to lost or corrupt data.
The suspension != disqualification.
The suspension can be applied in two cases:
Your suspension score is lower than 60%
Your online score is lower than 60%
The suspension score is affected when your node is online, answers on audit requests, but returns unknown errors instead of pieces. If your node would start to answer on audit requests (GET_AUDIT and GET_REPAIR) normally, the suspension score would be quickly recover with each passed audit. If your node answers with known errors like “file not found”, “disk i/o” or with corrupted pieces, it will affect the audit score instead.
If audit score would drop below 60% your node will be disqualified.
The online score is affected when your node doesn’t answer on audit requests at all. As soon as your node start to answer on audit requests, the online score would be slowly recover. To fully recover it requires to be 30 days online. Each downtime requires another 30 days to recover.
All scores are received from the satellites. If your node has been offline long enough to reset your online score to zero, it will be updated as soon as you manage to bring it online - the node will receive updated scores from the satellites. In your case, the online score is zero.
All three metrics are independent. Your node can be disqualified without affecting online score or suspension score, if the audit score would fall below 60%, the node will be disqualified.
@Alexey thanks for clarifications. But that doesn’t help at all, since my audit and suspension scores NEVER gone below 95% (I do monitor these every single day). As of now, it states:
So you’re saying that audit is working (as it’s 100% OK), but online is 0%… How can I have 0% online if audit is 100%?!
I’ll recreate the node as it has been disqualified. But there a HUGE issue on your end, you MUST fix it. Or audit and suspension scores are useless…
Node information:
ID 1E4A9cAijMMaL1DfD7VP6oYkqEoiHMeBskkuBeYG7ZMjukmMRt
Status ONLINE
Uptime 73h2m11s
Available Used Egress Ingress
Bandwidth N/A 0 B 0 B 0 B (since Mar 1)
Disk 134.71 GB 415.29 GB