"Online" indicators suddenly dropped

Hi Team,
On Friday I was checking my node and all seemed to go quite well. My audit / suspension / online indicators were almost all at 100% and some at 99,xx%.
The same day I looked again and they were at 0%. Ia assumed that it would some kind of glitch and let it to check for the next day. Then it turned to be barely recovering as if the yesterday’s fall was effective:


For sure the node has not been offline as I was accessing it remotely all the time and no news in that network as far as I check.
Any idea?
Something about scheduled down times? Should I worry? I don’t really know how low “online” indicator can affect the node reputation.

1 Like

it should take 288 hours for the online score to drop anywhere even close to that…
and shouldn’t be able to happen instantly afaik.

i would start by checking your logs, see if the storagenode is actually working, this could be a sign of something else being wrong rather than the internet connection being bad.

post some of the logs so we can review it, but if this is an internet issue the details in the logs about it won’t tell us much… but it’s a good place to start as your numbers are very weird afaik
so we want to make sure there isn’t anything really bad going on.

what system are you on?
maybe running a successrate.sh or something like grep looking for errors, can’t remember where the storagenode default log location is tho… and this is only applicable on linux i think.

but check your logs and post some of it especially areas with errors if there are some…

incase you are using docker you can export the log with something like this

docker logs --since "$(date -d "$date -1 days" +"%Y-%m-%d")" --until "$(date +"%Y-%m-%d")" storagenode >& /tmp"$(date -d "$date -1 days" +"%Y-%m-%d")"-storagenode.log

the log file will be placed in /tmp and if for the last full day, might want to process more if it,

then you simply remove the --since and --until parts but leave
storagenode >& /tmp"$(date -d “$date -1 days” +"%Y-%m-%d")"-storagenode.log

maybe call it something else like

“$(date -d “$date -1 days” +”%Y-%m-%d")"-storagenode-full.log

so it doesn’t overwrite the previous one … ofc only relevant if you are using docker…

This score is calculated from the 30 days audits check window, if your node is old enough, it should be 288 hours offline before the score drop below 60%.
But if your node is new, any offline will affect the online score greatly.
So, I think your node is new and was offline for a few hours.
Keep it online, in the next 30 days online the online score will recover.

Thanks for your comments…
My node is actually is docker version on raspbian.
It is an 1 year old node, so it is not new.
I can see anything weird on logs, in fact I can see anything in logs for Friday. Maybe it is because I think I restarted it (shouldn’t mean a flush of logs though). But for sure I was checking the logs when it happened but I couldn’t figure it out, I don’t really know what I should look for.
successrate.sh simply cannot be found in my system as a command.
My network and node where always up, no way to be a network issue.
I think it is an odd error, and somewhere accidentally those counters were set to zero.

a restart doesn’t clear the logs, they will only be cleared if you do a docker rm storagenode or when watchtower updates your storagenode.

so if your logs are empty your storagenode doesn’t have a connection… maybe your port forwarding in your router got lost, your ddns doesn’t work or something…

i would do a
docker stop storagenode
docker start storagenode
docker logs --tail 20 storagenode --follow

that should give you the boot sequence of the storagenode
the last command shows you the log live, tail 20 is so it goes back 20 lines, you can set it higher if need be…

ctrl+c to exit

and then i would check anything thats simple to check for you.
verify the setup basically… if the boot doesn’t give you any good hints

No, my logs were not empty on Friday.
It is impossible that it was without connection as it was online and 100% or near for all the indicators. Dropping from 100% to 0% doesn’t seem a lost of connectivity, it happened in minutes, while I was connected remotely to node (so network was perfect).
The logs now only can be seen until yesterday evening. Maybe because I had to restart Docker itself (I don’t remember really). It is a bit difficult because many times the start / stop sequence doesn’t work, it happens many times, but the node seem o be working as scores are high during months. It is absolutely confusing to me.
Sometimes during this year I have received a storm of emails telling that my 2 nodes, in different and remote networks were suspended. The solution was to ignore and do nothing, as no reason could be figured out, it and receive emails telling that I was ok again. I know it is a completely different matter but what I mean is that as a farmer I usually don’t know what is happening.

Please note: If restarted Docker was AFTER the drop from 100% to 0%. The moment it happened, no drastic actions were taken, no cuts on connectivity, indicators were excellent… it just simply happened

but it isn’t zero… so lets say it dropped, because the dashboard doesn’t update unless when you ask it to updated, then it could look like it dropped instantly.

it’s most likely been without a connection for a while… alexey seems to think it should be fine… but lets see if we can track down the problem… but that stuff can be rather tricky as you already seem to have figured out.

you should really have information in your logs, without information we cannot really help
the node will keep creating logs while online … ofc if you redirected your logs then the docker logs will be empty.

i doubt this is related… maybe if you do only manual updates

The online score is falling, when the satellite is coming to audit your node, but the node does not respond, such an audit considered as offline and affects the online score. It reports back to the node after a while (up to 12 hours later), so the node could being offline or unresponsive a while ago.

The suspension score is dropping when your node answers on audit request, but returns an error instead of the piece.
I would recommend to check logs to figure out, why it can’t pass the audit normally: https://support.storj.io/hc/en-us/articles/360042257912-Suspension-mode

@SGC I showed you a picture of the day after, that’s is why you don’t see zero, but it was 0% the day before.
Yes, I know the dashboard only will show fresh data if it is refreshed manually, but I was refreshing it before normally, it wasn’t an old screen I had for ages and then refreshed refreshed. 100% indicators were refreshed right before the 0%.
I agree with you that you can’t do much without log info, but as I said before, I checked the logs that day right after the mysterious episode and I didn’t see anything that catch my attention or was absolutely obvious.

1 Like

but the online score shouldn’t rise before after a month from the event that dropped them…

ofc unless if something else is wrong…

so lets assume it’s a sort of connection issue…
hmmmm now i’m getting idea’s for log analyzes scripts :smiley:

anyways i think we should get you that successrate.sh script up and running…

should be explain there… basically download it, do a chmod +x successrate.sh on it
and run it by ./successrate.sh

https://forum.storj.io/t/success-rate-script-now-updated-for-new-terminology-in-logs-after-update-to-0-34-6-or-later

run that and post the results, then lets take it from there… it will tell us if there are any errors in your logs… or the amount… there seems to always be errors…

It will slowly rise back during the next 30 days online. If you would have offline event again it will need another 30 days from this event to fully recover.

if thats audit based and i get audited atleast every minute, wouldn’t that mean that i basically cannot update my storagenode and my uptime score would never recover because i would always break atleast a window when updating…

i mean when it’s busy it’s maybe less than 15seconds between audits

from how i understand that, it would be exceedingly difficult to recover without skipping updates or something… and the larger the node the faster the audits…

It’s possible but there is 9000 nodes to audit, it can audit every node not frequently than 9.6 times in a day, to audit more frequently we uses several workers. However, auditing every minute is too frequently.
The current design uses 1 hour window

this was a semi active day with 100gb transfered in a 24 hour period.
doing 1759 audits / 24 = 73 audits an hour… and some people have 1/3 more data than me, and thus would see a 1/3 increase in audits, making it like 30sec avg between audits.
on a semi active day, if we get back to the days of when we had like 300GB transferred in a day that would give 3times the audits and thus push it down to maybe 10sec for the biggest nodes around…

i don’t really understand much about the uptime score tracking, but from how you are explaining it, then it sure sounds like there is an issue from what i see in the numbers on my node…

 ./successrate.sh sn1-2020-12-05.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            1759
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                1
Fail Rate:             0.003%
Canceled:              7
Cancel Rate:           0.020%
Successful:            35479
Success Rate:          99.978%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              44
Cancel Rate:           2.008%
Successful:            2147
Success Rate:          97.992%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            28282
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.033%
Successful:            3044
Success Rate:          99.967%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            6562
Success Rate:          100.000%

ofc not really related to this guy’s issues… but i have difficult imagining how one can get to zero :smiley:

and the error is stefanbenten btw :smiley: in case you were wondering

Results of the script say not so much…

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== DOWNLOAD ===========
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 0.000%
---------- accepted -----------
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%

Should I execute it in a particular path?

Try to run the script with sudo.

Thanks for that

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 209
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 12
Fail Rate: 0.260%
Canceled: 233
Cancel Rate: 5.049%
Successful: 4370
Success Rate: 94.691%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 4
Fail Rate: 0.037%
Canceled: 154
Cancel Rate: 1.441%
Successful: 10526
Success Rate: 98.521%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 576
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 4
Fail Rate: 0.030%
Canceled: 33
Cancel Rate: 0.245%
Successful: 13408
Success Rate: 99.725%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 3056
Success Rate: 100.000%

1 Like

well that looks fine, it does kinda make me wonder how long a period the log is from…
since you got 200 audits, and i have close to 1800 yesterday in a 24 hour period… so your full log with audits is like 1/9 the amount… which seems kinda low if it was a 24 hour period and in theory your log should be like… well much higher because it would be from a much longer period than 24 hours…

but no matter i’m sure we will figure out why that is…

i don’t suppose you got any sort of monitoring running for your connection / connections to verify that it’s actually stable,

if you go into the node and just do a ping www.google.com
and leave it running for a day or a few hours or two days… usually i start around the 1hr mark, if it still looks stable i go for the long one.

then ctrl + c to stop it, and it should post conclusion of how many packets where lost and what the avg ping was an such…

the ping should be below 80-100… maybe even down into the 10-15ms ranges depending on what kind of connection, it should be stable and not dropping any packets.

but yeah do a ping see if it tells us something… must be atleast 1 hour. sometimes one can see it right away tho… and 1 hour tho it should be enough… isn’t always… but it does become tricky to use ping very well for very extended tests… since it’s more of a rough measure.

but it’s a place to start without to much fuzz

It is a connection that is not professionally monitored but is used continously so any malfunction would have been noticed for sure.
But I will try the ping for one day or more and I will be back to you.
Thanks!