Dashboard storagenode log indicator

SGC · November 13, 2020, 10:40am

suggestion shortform:, a health indicator in the dashboard which is shows 100% based on an estimate of log errors over time, nothing fancy, nothing complex, just a simple way to scan a full day worth of logs in the blink of an eye.

Suggested Extension for Detailed Log Occurrence Overview
When clicking the Log Score, a Log Overview Page is accessed
this page would contain a counter of individual types of log occurrences, with associated tips, fixes or hyperlinks to forum posts dealing with the particular log type in question.

This could be further extended into a collective online database of full totals of log type occurrences and a count of how many nodes are affected by / posting the individual log type occurrences, so that Storj Lab engineers and interested SNO’s can better problem solve and compare issues across many / all nodes.

rant and reasons below

I think most of us want to keep proper track of our storagenodes, but that’s far from easy…

as recently demonstrated by the whole orders.db debacle, a segment of SNO’s ran their storagenodes with errors, without noticing it,

an error that if quickly solved wouldn’t have put Storj Lab’s nor SNO’s in a difficult position, so to alleviate such issues in the future i suggest what is basically a log indicator…

A simple 100% indicator, changing depending on how many errors there is in a log, i know this isn’t very accurate and will require some tuning to work, but i’m fairly confident that with the correct exclusions and a very basic addition / subtraction model of cumulative errors vs no or low errors over time, lets say a day will work as a great indicator of node health.

initially went with a colored indicator, but i figured it would make much more sense to group it with the other 100% audit, suspension, uptime scores… by simply adding one called log.

P.S
initially had it configurable, color coded (loosely inspired by the colored log)
but after trying to write it down and thinking it through, i figured it would be better if all nodes ran on the same default, and adding configuration to it would make the scores subjective or just plain wrong.
so better to spend some time on making it work in the beginning and then it will just work “across the board for all time”.
and well the 100% was obvious when i thought about it…

haven’t added detailed descriptions of the suggested extensions, but i think the conceptual idea is well covered in the short form at the beginning of this suggestion post.

kevink · November 13, 2020, 10:53am

I think an indicator is a good idea but I think a percentage is not a good thing because you either have errors in your log and need to fix something, or you don’t.
So if the node throws an error, it should show up on the dashboard. And the easist and most effective way to do that would be to just have a textfield showing the last error messages (plus a “clear” button next to it so you can reset that field after fixing the errors).

SGC · November 13, 2020, 2:52pm

maybe you are right… but i was just thinking of using the errors in logs over a day, there is always some amount of errors even if only a few each day, also maybe some stuff that shouldn’t be errors are logged as errors…

i duno how else to make it simple, i mean the more complex the feature the longer it will also take to build it and the longer it will take to adjust it…

i think the option of maybe getting a list of the type of errors and how many times they have occurred and such in like a sub menu / page accessed when clicking on the log score or something might work great.

i’ve been running the color log for many many days and it’s often that there is red in it… like say i duno if the stefan benten satellite still pops up in there… did recently … it’s been a while since i’ve been using the color log, but when i did last like a little month or so ago it did… only rarely but it was still throwing an error here and there… even tho they did remove most of them, last time when people complained

i think the face of the log overview should be very basic and something everybody already understands… so keeping it in line with the existing approach i think would work well.

a log score would also not be perfectly accurate atleast not at first… because stuff like lost connections also cause errors in the logs, which should be excluded or simply soaked by them not being enough to tip the scale… but lost connections vary wildly between SNO’s personally i get 0.01% error ratio, but thats the best i’ve had thus far… but mainly lost connections… but i suppose those should be excluded…

i like your idea tho, makes good sense, i just don’t think it required to be in your face info unless one clicks on the log score or such to access more detail information.

kevink · November 13, 2020, 2:57pm

If that’s still the case, it should be changed. An Error is something that indicates a problem that needs to be solved. Everything else is at most a warning.

SGC · November 13, 2020, 3:03pm

yeah a lost connection tho in some cases essentially an error, might not be useful to log as an error as it gets in the way of actual errors… but i really can’t say if it needs to be logged like that or not… thats mostly a question of who needs to use the logs for what… but it could maybe be moved to debug if it has to be registered as an “error” or like you say maybe as a warning…

but i’m sure the storj team has some ideas about why different stuff needs to be categorized in certain a certain way, and they could just be excluded from the count… but yeah maybe it would be better to just move them away.

Toyoo · November 13, 2020, 9:29pm

I don’t think it will help that much. Or, at least, I think another approach might be easier for Storj engineers to implement (no need for new UX—from my experience it is a surprisingly difficult thing to get right) and more valuable: collecting some sampling of log entries weighted by severity (let say, random 10% of ERROR entries, 1% of WARN entries, 0.1% of INFO entries, etc.) from storage nodes in some central location, maybe satellites? Then Storj engineers can monitor the problems themselves, and react, potentially faster and more precisely than what an average SNO is capable of.

In other words, some simple telemetry.

Alexey · November 13, 2020, 11:00pm

8000+ nodes. Telemetry. Manual monitoring by engineers.
I don’t think it’s possible.
The average stat is available already, such overhead doesn’t make any sense.
The errors should remain in the source (nodes) and fixed if possible by operator.

KernelPanick · November 13, 2020, 11:04pm

There are known fixes for certain errors. Notification of those events could be automated. The dashboard could give recommended tips to SNO’s that are less technically inclined.

If you’ve ever used netdata, you know what i’m talking about.

Alexey · November 13, 2020, 11:11pm

This is an extension of suggested indicator. @SGC would you like to add it to the short part of the description?

Toyoo · November 13, 2020, 11:19pm

Well, you can either train 2000+ storage node operators to set up monitoring and/or manually watch the dashboard/logs/etc., or have one Storj engineer set up proper monitoring with logstash/netdata/whatever.

A happy side effect will be that Storj itself will start feeling internal pressure on improving logs.

SGC · November 13, 2020, 11:22pm

i don’t follow what you mean

Alexey · November 13, 2020, 11:23pm

@Toyoo suggested extension to your log indicator

SGC · November 14, 2020, 9:11am

so basically an extension of the extension kevink already suggested… kinda thought about that also, but starting to have databases, hook up to what comes out of the log overview, writing tips / info behind the errors…

sure in a sub page that could be accessed when clicking on the indicator / log score
just seems like a lot of bells and whistles for an “indicator”

if it can write tips in regard to problems found in the log, it’s not a big step from that and to giving it the option to correct some problems where the problem has a known and manageable solution.
maybe add some GPT 3 deep learning while we are at it…

hehe i’ll do a rewrite for it, but i still hold what should be on the front dashboard should be a simple indicator, like those already existing…

then what one gets access to when one clicks it, tho cool… is kinda secondary because the purpose of my idea was to make sure people don’t miss errors in their logs for months…

i suppose it’s possible that an indicator cannot work, depending on how frequently problems arise and maybe a log occurrence list makes better sense, but i guess we will find out…

@KernelPanick yeah netdata is pretty cool, when stable always seems to crash for me, but i suppose i run it to long and to passive for it to be stable.

i think the idea here is basically what we also ended up with, kinda didn’t understand it at first…

as i see it the idea seems to be sorting the log entries by occurrences, using an online database to get tips and such, and upload the processed log data in simple form to the database, so that engineers at storj labs can go into the database and see… oh we hand 41 qarzillions lost connections, 5million qarzillion transferred pieces, 3000 orders.db errors in 150 nodes… i guess we should look into that…

i kinda like that idea actually, and having a collective database, keeping track of log occurrences an number of nodes, maybe even make it possible to find which nodes reported it, in case there is some kind of error that needs to be backtracked in the future.

but yeah… ill most definitively rewrite my suggestion, but i still hold the frontpage will just be a 100% log score which then could be a link for the page with more options.

jammerdan · December 14, 2020, 6:51am

I don’t know if that would have helped with the orders issue.
AFAIK the orders get sent once every hour? That would make 1 error log entry in relation to hundreds of successful other log entries.
So a percentage in that case would have shown like 0.01% error 99.99% successful. I doubt that this would make SNOs look into their logs.
Also there are many types of errors. A simple download failed error does not mean a lot even if it occurs multiple times.

Pentium100 · December 14, 2020, 8:08am

I think that the severities of some log entries should be adjusted.
For example (log entry shortened):
ERROR piecestore download failed … use of closed network connection", “errorVerbose”: "write tcp 172.17.0.2:28967->176.9.121.114:53398: use of closed network connection

This probably means that there was some kind of problem of internet connectivity between me and the customer which caused the tcp connection to be dropped. So, not really a big problem, also, not much I can do about it.
However, this has the same severity as “audit failed, file not found” and “database corrupt”. If I made it so that “ERROR” log entries would get emailed to me or something, I would be spammed by these and would probably just ignore them.

I guess I can make it so my script ignores these particular errors, but there are more of the similar type (“connection reset by peer”, “tls: use of closed connection”, “write: broken pipe” etc). If I filter out all of those, I’m left with a single instance of “trust: rpc: context canceled” which probably also is not that bad.

OK, I can do this, but I think that the severities should maybe be adjusted a bit

kevink · December 14, 2020, 8:20am

yeah storjlabs developers have a strange understanding of log levels. but at least it has improved so there’s hope for the future

I think this would help enormously (and you mostly only need the UI team for that):

jammerdan · December 14, 2020, 8:22am

It would help already if it would be possible to access the logs via dashboard.

kevink · December 14, 2020, 8:36am

The files would be too big and too much “noise” in there. People would never go through those lines and find the errors. You’d need a logfile with only error and above for that to make sense.

jammerdan · December 14, 2020, 9:01am

At least for Docker shouldn’t there be already made solutions? I don’t think the Storj container is the only one that would require some sort of log analysis.

Alexey · December 14, 2020, 9:19am

This is mean that your router or ISP’s hardware is not good enough. That’s mean that you will have problems, if you would like to start GE - if you would have more than 10% failure rate the GE would likely fail.
The other problem - you have several missed potential customers for your node. It’s better to do not have such errors, especially if you see a lot of them (for example - my 3 nodes have 4 such errors for the last two months in total).
For the comparison, the raspi3 have 27 such errors