Dashboard storagenode log indicator

SGC · December 14, 2020, 9:24am

which is why i mention this in rant and reasons

i’m aware that it cannot be a basic 1 to 1 because it will essentially drown out the errors, but the concept is pretty good… i use it already when running the colored log…
gives me red for errors and yellow for cancelled and green for making money (downloads)

sure it’s not 100% accurate, but it’s still accurate enough that i many times spotted the stefan benten satellite code remnant, which is pretty rare to be fair… also many knowledge errors such as context errors could be filtered out…

it’s an indicator, just needs to indicate if there is something one should look at might not mean anything aside from the light is broken or that you are low on oil… but if you check it when it lights up then once in a while it will save your engine.

but like i said it will need some fine tuning to be very practical
just like when i look at the colored logs, most of the time i got no idea about whats going on… i just see colored lights which tells me whats going on…
it just needs to pass that a long in a way where the true errors will stick a bit more on the indicator.
if one wanted to make it really simple one would just transfer the color log approach to begin with and then have exclusions, or simply count errors over time…

i think the right approach is to make a highly configurable indicator in the dashboard, most likely one that’s not even connected to the dashboard, its and indicator that shows yellow by default if not connected to a script or whatever controls it.

a control script would then access a configuration file, which contains log exclusions, log coloring scheme or severity numbers. so that it’s basically 100% configurable after the fact that it’s been added to the dashboard… because for the first year or so of using this, there will be many modifications, to include all the oddities people see.

and so that people can reconfigure the indicator to help it find their particular issue…
hell i could most likely make one myself… i should just take the storj logo and make a few colored versions and run and awk script that monitors logs and switches the storj logo image in the dashboard…

lol that would be awesome… i might try and do that…

Toyoo · December 14, 2020, 9:30am

Isn’t this error reported also if the problem is on the customer side, or the client dropped connection because it already got enough stripes from other nodes?

Alexey · December 14, 2020, 9:41am

When the customer drop the network connection, it’s reported a little bit different (“write: broken pipe” or “write: connection reset by peer” and few other similar).
But, I think you could be right. Because I see the same IP 176.9.121.114 and this is a test worker and my pi maybe is not fast enough.

Alexey · December 14, 2020, 9:44am

Can you check please, do you have a both “network” and “i/o” in the same error?
My nodes does not have such errors, but I saw such in logs of other operators.

Pentium100 · December 14, 2020, 10:00am

cat docker.00042.log | grep ERROR | grep i/o
empty result

In the same log (node uptime 134h) I have 339059 “downloaded” entries and 291 “download failed” (0.086%) entries, all of those are one of the network types. Some IPs are mentioned more than others:

      3 5.12.184.95
      4 188.26.15.38
      5 5.12.167.105
     10 5.12.175.79
     14 5.12.146.130
     72 46.4.33.240
    163 176.9.121.114

the remaining 8 IPs are mentioned once.
there are also 13 tls: use of closed connection errors and one trust: rpc: context canceled error.

So, I’d say that it is more likely that the problem is at the other end or I would have encountered more of these errors. I’m pretty sure my network gear and ISP can handle 10-12mbps of Storj traffic.

Toyoo · December 14, 2020, 10:33am

If it may be useful, data from my logs (about half a year of logs from several nodes):

# <logs/* grep -a download.failed | sed -e 's/^.*error": "//;s/".*$//;s/[0-9]\+/X/g' | sort | uniq -c | sort -nr
  12061 write tcp X.X.X.X:X->X.X.X.X:X: use of closed network connection
  11761 write tcp X.X.X.X:X->X.X.X.X:X: write: broken pipe
  10064 tls: use of closed connection
    821 write tcp X.X.X.X:X->X.X.X.X:X: write: connection reset by peer
    225 trust: rpccompat: context canceled
     62 write tcp X.X.X.X:X->X.X.X.X:X: write: connection timed out
     56 context deadline exceeded
     51 trust: rpc: context canceled
     47 untrusted: unable to get signee: trust: rpc: dial tcp X.X.X.X:X: operation was canceled
     15 untrusted: unable to get signee: trust: rpc: dial tcp: operation was canceled
     12 file does not exist
      7 usedserialsdb error: context canceled
      5 order created too long ago: OrderCreation X-X-X X:X:X.X +X UTC < SystemClock X-X-X X:X:X.X +X UTC m=+X.X

And same thing limited to December:

 76 write tcp X.X.X.X:X->X.X.X.X:X: use of closed network connection
 64 write tcp X.X.X.X:X->X.X.X.X:X: write: connection reset by peer
 36 tls: use of closed connection
 26 write tcp X.X.X.X:X->X.X.X.X:X: write: broken pipe
 11 untrusted: unable to get signee: trust: rpc: dial tcp: operation was canceled
  9 trust: rpc: context canceled
  5 untrusted: unable to get signee: trust: rpc: dial tcp X.X.X.X:X: operation was canceled
  5 context deadline exceeded
  4 write tcp X.X.X.X:X->X.X.X.X:X: write: connection timed out

SGC · December 14, 2020, 10:50am

yeah and most of those don’t look overly important and the amounts are pretty low, sure there are a few that could be signs of something… but there seem to be so many errors that can pop up in the logs that it’s difficult to tell for the uninitiated.

so it’s better just to do an estimate…
maybe say when a storagenode reboots the health indicator is a bit more sensitive, until it decides if stuff is going well or bad… sort of maybe disregarding or down rating the pre reboot health value.

so if it gets over a certain amount of errors over like say the first minute, it will go yellow and if it’s persists it will go red. oh wait i forgot this was suppose to be a % score…
maybe just copy the audit online approach with windows… if the avg of a window have over a certain amount of errors it will decrease the health score.

to help make accurate health indication after updates, which is often where issues will start to pop up, even if they have been existing for a while lurking in the background wanting to be activated.

might actually be a pretty good approach to just copy online score window model… it makes the concept of the scores easier to learn in detail, and it should create a fairly accurate score both short and long term…

these you may be able to fix if you reuse connections in your network configurations, then it shouldn’t really close connections, just recycle them… when idle for long… which helps reduce the maximum number of open connections and reduces network io

i guess that also goes for the top ranking on for december… might even be more relevant there because that doesn’t seem to involve TLS which i suppose the TLS may time out or whatever, which would make a connection null and void after a while… but duno just guessing.

certainly i like to run reuse on my tcp configuration, not sure there is any real detrimental effect to it.