"Online" indicators suddenly dropped

No, my logs were not empty on Friday.
It is impossible that it was without connection as it was online and 100% or near for all the indicators. Dropping from 100% to 0% doesn’t seem a lost of connectivity, it happened in minutes, while I was connected remotely to node (so network was perfect).
The logs now only can be seen until yesterday evening. Maybe because I had to restart Docker itself (I don’t remember really). It is a bit difficult because many times the start / stop sequence doesn’t work, it happens many times, but the node seem o be working as scores are high during months. It is absolutely confusing to me.
Sometimes during this year I have received a storm of emails telling that my 2 nodes, in different and remote networks were suspended. The solution was to ignore and do nothing, as no reason could be figured out, it and receive emails telling that I was ok again. I know it is a completely different matter but what I mean is that as a farmer I usually don’t know what is happening.

Please note: If restarted Docker was AFTER the drop from 100% to 0%. The moment it happened, no drastic actions were taken, no cuts on connectivity, indicators were excellent… it just simply happened

but it isn’t zero… so lets say it dropped, because the dashboard doesn’t update unless when you ask it to updated, then it could look like it dropped instantly.

it’s most likely been without a connection for a while… alexey seems to think it should be fine… but lets see if we can track down the problem… but that stuff can be rather tricky as you already seem to have figured out.

you should really have information in your logs, without information we cannot really help
the node will keep creating logs while online … ofc if you redirected your logs then the docker logs will be empty.

i doubt this is related… maybe if you do only manual updates

The online score is falling, when the satellite is coming to audit your node, but the node does not respond, such an audit considered as offline and affects the online score. It reports back to the node after a while (up to 12 hours later), so the node could being offline or unresponsive a while ago.

The suspension score is dropping when your node answers on audit request, but returns an error instead of the piece.
I would recommend to check logs to figure out, why it can’t pass the audit normally: https://support.storj.io/hc/en-us/articles/360042257912-Suspension-mode

@SGC I showed you a picture of the day after, that’s is why you don’t see zero, but it was 0% the day before.
Yes, I know the dashboard only will show fresh data if it is refreshed manually, but I was refreshing it before normally, it wasn’t an old screen I had for ages and then refreshed refreshed. 100% indicators were refreshed right before the 0%.
I agree with you that you can’t do much without log info, but as I said before, I checked the logs that day right after the mysterious episode and I didn’t see anything that catch my attention or was absolutely obvious.

1 Like

but the online score shouldn’t rise before after a month from the event that dropped them…

ofc unless if something else is wrong…

so lets assume it’s a sort of connection issue…
hmmmm now i’m getting idea’s for log analyzes scripts :smiley:

anyways i think we should get you that successrate.sh script up and running…

should be explain there… basically download it, do a chmod +x successrate.sh on it
and run it by ./successrate.sh

https://forum.storj.io/t/success-rate-script-now-updated-for-new-terminology-in-logs-after-update-to-0-34-6-or-later

run that and post the results, then lets take it from there… it will tell us if there are any errors in your logs… or the amount… there seems to always be errors…

It will slowly rise back during the next 30 days online. If you would have offline event again it will need another 30 days from this event to fully recover.

if thats audit based and i get audited atleast every minute, wouldn’t that mean that i basically cannot update my storagenode and my uptime score would never recover because i would always break atleast a window when updating…

i mean when it’s busy it’s maybe less than 15seconds between audits

from how i understand that, it would be exceedingly difficult to recover without skipping updates or something… and the larger the node the faster the audits…

It’s possible but there is 9000 nodes to audit, it can audit every node not frequently than 9.6 times in a day, to audit more frequently we uses several workers. However, auditing every minute is too frequently.
The current design uses 1 hour window

this was a semi active day with 100gb transfered in a 24 hour period.
doing 1759 audits / 24 = 73 audits an hour… and some people have 1/3 more data than me, and thus would see a 1/3 increase in audits, making it like 30sec avg between audits.
on a semi active day, if we get back to the days of when we had like 300GB transferred in a day that would give 3times the audits and thus push it down to maybe 10sec for the biggest nodes around…

i don’t really understand much about the uptime score tracking, but from how you are explaining it, then it sure sounds like there is an issue from what i see in the numbers on my node…

 ./successrate.sh sn1-2020-12-05.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            1759
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                1
Fail Rate:             0.003%
Canceled:              7
Cancel Rate:           0.020%
Successful:            35479
Success Rate:          99.978%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              44
Cancel Rate:           2.008%
Successful:            2147
Success Rate:          97.992%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            28282
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.033%
Successful:            3044
Success Rate:          99.967%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            6562
Success Rate:          100.000%

ofc not really related to this guy’s issues… but i have difficult imagining how one can get to zero :smiley:

and the error is stefanbenten btw :smiley: in case you were wondering

Results of the script say not so much…

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== DOWNLOAD ===========
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 0.000%
---------- accepted -----------
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 0
Success Rate: 0.000%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%

Should I execute it in a particular path?

Try to run the script with sudo.

Thanks for that

========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 209
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 12
Fail Rate: 0.260%
Canceled: 233
Cancel Rate: 5.049%
Successful: 4370
Success Rate: 94.691%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 4
Fail Rate: 0.037%
Canceled: 154
Cancel Rate: 1.441%
Successful: 10526
Success Rate: 98.521%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 0
Cancel Rate: 0.000%
Successful: 576
Success Rate: 100.000%
========== REPAIR UPLOAD ======
Failed: 4
Fail Rate: 0.030%
Canceled: 33
Cancel Rate: 0.245%
Successful: 13408
Success Rate: 99.725%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 3056
Success Rate: 100.000%

1 Like

well that looks fine, it does kinda make me wonder how long a period the log is from…
since you got 200 audits, and i have close to 1800 yesterday in a 24 hour period… so your full log with audits is like 1/9 the amount… which seems kinda low if it was a 24 hour period and in theory your log should be like… well much higher because it would be from a much longer period than 24 hours…

but no matter i’m sure we will figure out why that is…

i don’t suppose you got any sort of monitoring running for your connection / connections to verify that it’s actually stable,

if you go into the node and just do a ping www.google.com
and leave it running for a day or a few hours or two days… usually i start around the 1hr mark, if it still looks stable i go for the long one.

then ctrl + c to stop it, and it should post conclusion of how many packets where lost and what the avg ping was an such…

the ping should be below 80-100… maybe even down into the 10-15ms ranges depending on what kind of connection, it should be stable and not dropping any packets.

but yeah do a ping see if it tells us something… must be atleast 1 hour. sometimes one can see it right away tho… and 1 hour tho it should be enough… isn’t always… but it does become tricky to use ping very well for very extended tests… since it’s more of a rough measure.

but it’s a place to start without to much fuzz

It is a connection that is not professionally monitored but is used continously so any malfunction would have been noticed for sure.
But I will try the ping for one day or more and I will be back to you.
Thanks!

a working connection can still be unstable, somethings might barely notice it, while others can be very sensitive to it… live stuff like video calls or VoIP are those people usually notice minute issues with their connections… so if you do extended calls like that without any issues, then it’s most likely stable…

atleast your node seems in good health and the successrates seems good

Hi again

I come with some more information… but even more confused than before :slight_smile:
After one day left it to work, dashboard shows “offline” status.
But the logs show that I have healthy traffic anytime I look. Is it offline but exchanging packets like: ?

2020-12-08T16:56:28.933Z	INFO	piecestore	downloaded	{"Piece ID": "A727EZTULHDASQPJ35B4Q4VZY7AYNNRTWENUTXWRCHK37P6IAEYQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}

At the same time I find some errors if I try command:

sudo docker logs storagenode 2>&1 | grep "error"

Like, for instance, this one:

2020-12-06T01:25:38.884Z	ERROR	piecestore	upload failed	{"Piece ID": "UBBHFYXPBTP7EGLFALVBGYHXGNF7P33RRXDIDQNHYI6CKWAXRSXA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT_REPAIR", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).RawFlush:287\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:325\n\tstorj.io/common/pb.(*drpcPiecestoreUploadStream).SendAndClose:1064\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:407\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

or

2020-12-06T11:19:27.988Z	ERROR	piecestore	download failed	{"Piece ID": "DKXXYPTSOKOKZBU66PVT4YIPVEGGKAGELTKHXDIAXKMTU7MTU6EQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "error": "write tcp ((some-ip)):28967->((some-other-ip)):37192: use of closed network connection", "errorVerbose": "write tcp ((some-ip)):28967->((some-other-ip)):37192: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:228\n\tstorj.io/drpc/drpcwire.SplitN:29\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:322\n\tstorj.io/common/pb.(*drpcPiecestoreDownloadStream).Send:1089\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:580\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

In general, during this year, node seems to be offline, then I restart, but indicators are always near 100%… really it doen’t make any sense to me.

EDIT: 10min after what I told you, dashboard shows “online” ¿?

EDIT2: Now I see some suspension notification…

You have 11pages that isn’t a good sign. It means you have a few issues going on with your node.

From page 5 and below it is blank.
And the notifications seem to start from “yesterday” until 1 hour ago.
So problems seem to start after I started this thread… no idea what is going on…

And how is it possible that I am suspended, but online, but with 100%/100%?

I think you can be online if your node responds to it but suspension isnt tied to being online its being able to get the files or not. Either something is blocking it or your network or internet can’t handle it.