ERROR MESSAGES! for the love of god.. simple error messages... please

SGC · January 26, 2021, 7:24pm

Error messages for the love of god, simple error messages please…
it would be wonderful if we could get some sort of log translation tool or clear error messages in log lines.

usually it’s just an insanely long stream of confusing text which we even with some time spent basically cannot decode easily…

it would be very practical if there simply was an error message… so like it would tell us if, its real problem or just a connection being dropped.

doesn’t even have to be advanced, just good enough that one can sort out the errors one is suppose to just ignore… like dropped connections or latency or whatever causes them.

not sure if anyone has suggested something like this yet.

Edit by @kevink:
These error messages should have a different log level:

download failed & upload failed with closed networks and similar → INFO should be enough since it’s part of the normal operation of a node, WARN at most.
graceful exit satellite not found error if you GEed on stefan-benten → WARN. It’s understandable that this is an error because of how it is supposed to work but it makes looking for real errors difficult. So until this is fixed, the error message should be a WARN
Unexpected shutdown of runner → WARN since that is just due to stopping the container but it could be argued that this is an acceptable error message, however, it doesn’t indicate that something is broken or need to be fixed. It just informs that the runner was shutdown unexpectedly but it will just continue fine after a restart. I understand that the runner probably doesn’t know the difference of it being killed or the node being stopped/restarted.
fatal error message you sometimes get when restarting/stopping/updating your node. Shouldn’t happen (as discussed in a different thread) but still does happen. And therefore it is annoying, because a fatal error is something that breaks your node but in this case you actually have to ignore a fatal error because it is “normal”.
I’m not sure what log level “upload rejected” is but it should be a WARN.

Appreciate the new “Setup” option is to prevent nodes starting when pointing at wrong location, but seems this is creating it’s own issues with SNO operators not expecting to see this new command when expanding number of nodes.

Would be great if the error generated;

Error: Error starting master database on storagenode: group:
— stat config / storage / blobs: no such file or directory
— stat config / storage / temp: no such file or directory
— stat config / storage / garbage: no such file or directory
— stat config / storage / trash: no such file or directory

Could also include something like.
Error: Error starting master database on storagenode: group: - If this is a new node, make sure you have run the setup command first to create config.

kevink · January 26, 2021, 7:55pm

There are a few log messages that shouldn’t be ERRORs but otherwise they are rather clear to me. The end of each message is a function traceback, nothing that is relevant to a normal SNO and you certainly don’t need to understand that part if you’re not going to look at the code.

In my log exporter dashboard I even excluded the wrongly classified errors from the error count (download failed, upload failed, etc).

Toyoo · January 26, 2021, 8:02pm

Me, I think twice. Always happy to upvote though!

SGC · January 26, 2021, 8:15pm

had one upon booting my nodes after update, something untrusted something.
i don’t get why there just can’t be some basic classification, so one doesn’t have to understand them, but simply can understand if its important or just something that is not responding or new / semi new lines that bug out because it’s not fully implemented yet.

feels like such a big waste of time that every time i see a new type of closed connection type i have to go to the forum and start digging for what it means and usually it doesn’t matter…

sure i can sort of guestimate that it’s not important since if it does it at boot a few times and then runs smoothly with no errors after that.
but what if it was actually a dangerous error i should pay attention to…

it just pisses me off and it’s been like this forever, granted i could learn exactly how the read the log lines, which i sort of feel i can… but every now and then something new pops up…

i know the developers read this stuff with ease… but the rest of us don’t and to be fair it’s wasting our collective time everytime 500 of us have to go track down an error that means nothing just because it’s not clearly marked.

kevink · January 26, 2021, 8:19pm

there is. It’s called loglevels. And it would work great if all log messages would have been classified correctly.

They did correct some wrongly classified error messages but there are still too many…

SGC · January 26, 2021, 8:22pm

not sure they even tried… i mean it’s so terrible, i cannot believe much effort was put towards it.
yeah log levels are great… when they work.

kevink · January 26, 2021, 8:33pm

Probably not much work but still some work. in the old days (maybe you weren’t around at that time yet) there were a lot of ERROR messages. It was completely worthless imho but they changed it. Now you mainly need to ignore download failed & upload failed and with closed networks and the graceful exit satellite not found error if you GEed on stefan-benten. Also unexpected shutdown of runner since that is just due to stopping the container but it could be argued that this is an acceptable error message.
Then there is the fatal error message you sometimes get when restarting/stopping/updating your node. That one shouldn’t be.
Maybe there are a few other minor messages but that’s the majority of messages you need to ignore.

SGC · January 26, 2021, 8:46pm

it’s like traffic signs… there is a reason they don’t have much text on them…
some people don’t know how to read or have trouble doing so…
there should be clear markers in the logs.

Alexey · January 27, 2021, 6:43am

You also can search here, on the forum.
The Community even tried to collect them:

SGC · January 27, 2021, 10:02am

you say that like it’s a good thing, it just goes to prove how absurd the issue is.
basically needs a translator to see if the log says the node is okay or not…

Toyoo · January 27, 2021, 8:27pm

Sorry, but this is missing the point. It’s about not having to waste human time on decoding log entries, not to waste more time doing so.

Storgeez · January 27, 2021, 10:36pm

I think Storj is one of the companies who is winning by having verbose error messages - nothing worse than something like “Error 0x00000181B”. Error messages should always have the actual errors in them - what is the purpose of saving space? Do you want somebody to bring your program to service center to fix it?. They CAN have an error code in front of the message for example, can’t say that would hurt anything, could be helpful. They should not omit the actual error message though. I think the current system is great.

Perhaps somebody with more experience would have a better system to do this, but anything that doesn’t tell you what is wrong without additional information/effort is bad.

SGC · January 27, 2021, 10:52pm

that’s a good point

i suppose i wasn’t really thinking error codes so much as the clustering of errors types they bring.
should rephrase it a bit.

yeah i was thinking something like the log level system, just that it was actually used.
or like you say a code in the beginning of the error definition that just classify them into groups so people can actually easily read if an error can be safely ignored or if its an OMG IT’S KEALING ME!!! CRITICAL! CRITICAL!

the sad thing is that it’s basically already existing, but it’s basically not used correctly.
maybe they think it’s difficult to put stuff into it, so adding some simple little error message to the front might be an easier sell.

there fixed it, and added a quote to explain the problem a bit better

Storgeez · January 28, 2021, 1:38am

Yes, definitely the message levels should be adjusted. They’ve changed almost all of these error messages before, but there’s more left, basically everything you commonly see should be an info message - cancellations (due to tail end), successes, notes about node status (or those could be status messages). It should be refactored to be more logically structured. But the text part of the message is good.
Also the download successful entry could be for example “0x3A download successful”, by all means, add codes, just keep the verbose part. Helps everybody, easier to handle by software, easy to be read by less technical humans.

Also this: View logs directly in the dashboard

Alexey · January 28, 2021, 4:00am

Could you please, post errors which you think in a wrong log level to make a more precise feature request, the examples would be helpful too.

kevink · January 28, 2021, 6:26am

Those are wrongly classified and happen very often. I don’t know about other ones as those are kind of rare. I even manually excluded those error messages on my dashboard so I don’t get alerted from an every increasing number of error messages that are completely harmless and irrelevant and part of the normal operation of a node. Error messages are there to alert me of a problem, which these don’t.

Not sure what the “upload rejected” is classified as because I don’t get those.

Alexey · January 28, 2021, 7:11am

@SGC Could you please add those errors which could be WARN level instead?
Or @kevink, I converted the top post to wiki

kevink · January 28, 2021, 8:35am

Added the first post with the log level changes I see.

SGC · January 28, 2021, 8:59am

not quite sure exactly, lots of messages i don’t know in the logs at times.

if we start with the most fundamental concept i think, the dropped connections which as classified as errors should be something like, ofc then people cannot see it on default log levels, which would also be a considerations.

stuff like failed audits should be like CRITICAL log level, so that one can set the log level on critical and just see if there are any physical problems with the node.

version warning when the node is getting out of date might also belong in CRITICAL log level since it can get one DQ… not totally sure about that…might just belong in WARN.
because lets face it, the version can be behind for a fairly long time… so in the most cases not a CRITICAL issue, more like a fix me tomorrow… not a GET OUT OF BED issue

would be great if we could adjust the log levels that would be so much more useful.
it’s about people being able to basically put the log level at critical and then they will only see if the node is breaking.

and that people which don’t want or can’t read very well can just look at the simple description of the log level to determine if it’s okay…
just because someone might not be able to read well, doesn’t mean they cannot read stuff like INFO, WARN, CRITICAL, ERROR or DEBUG and they can even easily follow a pattern of like say INFO down the side of the log.

so first error sticks out like a sore thumb… so they start decoding the error message, which is basically the length of the first chapter of hamlet, only to find out it’s just a dropped connection,
also since we don’t get DQ for downtime, then a bad connection or even no connection is sort of an error, but at the highest level a WARN

i think @kevink seems to be the one with the best hang of this, but i still think we should take a while to be sure that all the different logs are placed in the right places…

debug is also a useful log level that is disabled by default, i forget what’s in there not much in general if memory serves, debug should ofc be disabled, and it’s main purpose is…

ill do the inverse of what you ask me to… ill write definitions of the log levels
so people can understand why, stuff belongs on the different levels.

kevink · January 28, 2021, 9:06am

I think this explains well enough what log levels should be used for: https://www.ibm.com/support/knowledgecenter/en/SSEP7J_10.2.2/com.ibm.swg.ba.cognos.ug_rtm_wb.10.2.2.doc/c_n30e74.html

A CRITICAL error is like a FATAL error and should never be used for what you suggest because that log level is for problems that kill your program. A missed audit doesn’t kill your program, too many do kill your node but that’s a different problem. It’s about what kills a program.