ERROR MESSAGES! for the love of god.. simple error messages... please

SGC · January 28, 2021, 9:31am

very valid argument

so a fatal error would be if the node software doesn’t shut down correctly.

not sure i would fully agree that critical errors are like fatal errors… fatal is lethal while a critical error might be recoverable… but that is sort of semantics…

what are the log levels?

DEBUG - (to save space and remove clutter and ofc aid in making debugging much eaiser)

INFO - (most stuff is here… not sure if that should be in debug, but that might confuse people)

WARN - (stuff like broken connections, this can essentially be errors, but unimportant ones )

ERROR - (something isn’t working

CRITICAL - (potentially fatal issues long term, or stuff a SNO would always have to know, remember the logs levels are about being able to sort out clutter also)

FATAL - (software / storagenode killing

oh isn’t there also a LOG level log level rather than INFO
info seems above LOG, so it goes like DEBUG, LOG, INFO… like above

Alexey · January 29, 2021, 6:01am

There are five log levels:

None - nothing logged
Error - includes only Errors and Fatal Errors
Warn - warnings, includes above
Info - information messages, includes two above
Debug - some debug information, includes three above

SGC · January 29, 2021, 9:54am

i guess thats enough, no point in making it more complicated than it has to be.

debug also seems very useless in how the log levels are used on the storagenodes…
maybe thats where all those closed connections and such belongs

@kevink

any thoughts on if those where placed under debug instead… i think it’s a pity debug isn’t used more and this is a perfect place to put these … i mean should people really see them… ofc then we would partially break the successrate.sh script and everybody that wasn’t to use that should run debug.

maybe it’s not worth it…
it’s a shame tho that debug isn’t used for more stuff… it would be a perfect place to put all sorts of confusing stuff… ofc then one would have to set the default log level to debug and then people can reduce logging if they want… since all the INFO messages are like 99% anyways… doesn’t seem to matter if the log is on debug by default… unless if we could move the 95% or whatever of the logging into debug and since thats upload and downloads…

debug seems like the place where the devs put the confusing stuff or where the majority of the logging data belongs.

let’s get @BrightSilence @Pentium100 and @littleskunk
they will without a doubt have some opinions and good ideas on this.

BrightSilence · January 29, 2021, 11:17am

Nooo, debug is what is usually not shown in the logs at all. It’s only for extra verbose logging when looking for the cause of a very specific issue. While I agree that those closed connections shouldn’t be errors, that seems like the perfect example of a warning level message. Meaning, a couple of them every once in a while is fine, but a log with just that is probably a signal that something is wrong. Putting these in debug would mean that nodes by default won’t log them at all and you’d be missing a very important signal that something might be off.

Storj currently doesn’t use debug logging a lot. Normally that would be used for a lot of the sub processes. Say for an upload it could log the information exchange to start the update, things like signature verification checks. Every incremental order limit. In most software this will make the logs explode and so you would only turn it on when you are investigating. It should absolutely be off by default. If it’s not, Storj would never have the option to actually add more actual debug logging in the future.

Additionally, there is an order of severity. Debug is less severe than info. So you should never include something going wrong in debug logging.

Btw, I think it may not be as simple as just reducing a few to warning. The problem is that depending on when a specific error appears it may be considered a warning or an error. Like a closed connection for a normal upload or download might not be so bad and should probably be a warning. But if it happens during an audit, it should probably be considered an error.
That said, I do wonder why Storj so far hasn’t used warnings at all. There must be some reasoning behind that. I would love to do a few suggestions, but if there is a philosophical reason to not use warnings at all, that would be kind of wasted effort.

Pentium100 · January 29, 2021, 11:29am

IMO it should depend on the cause.
closed connection - > INFO, as it’s not really a problem.
file not found → ERROR, as this is probably a big deal

closed connection should be WARN at most. Yeah, it these happen a lot there may be a problem, but once in a while is normal. OTOH, “file not found” happening once could be a big problem.

kevink · January 29, 2021, 11:39am

The source of the error is also relevant. Like the closed connection is something client-side and the node can do nothing about it, so it’s not an error on its end anyway, so INFO is enough.
File not found is an error because that should not happen and you might need to investigate why a file is missing.

And as mentioned by BrightSilence, debug is only for debugging a certain problem and should offer a significant amount of information but is irrelevant for the SNO.

From what I’ve seen so far, I don’t think they thought too much about the classification of errors. But I could be wrong, in which case I’d like to hear that philosophy too.

SGC · January 29, 2021, 1:56pm

yeah same here.

good point, so we would leave debug alone.

yeah that makes sense, it’s about making it so one doesn’t have to worry without needing a dictionary imo, might be useful with an additional fatal or critical level… but i duno… 3 log levels if we aren’t counting debug isn’t much… might be thing worse than errors.

BrightSilence · January 29, 2021, 3:42pm

Just wanted to post some examples of what I think should probably warnings. I think these cover the majority of what still remains. Honestly a lot has already been cleaned up.

Unexpected EOF

2021-01-29T13:14:14.093Z ERROR piecestore upload failed {"Piece ID": "C3BEPFFW33RL5KPV7S3DXX4CVCXTL5SNUHVH7IR42L3HZJDQOKGQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "unexpected EOF", "errorVerbose": "unexpected EOF\n\tstorj.io/common/rpc/rpcstatus.Error:82\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:325\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:996\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51", "Size": 1318912}

Context deadline exceeded

2021-01-29T04:34:14.704Z ERROR piecestore upload failed {"Piece ID": "V5B4PVK5JVRHZP7OF66HTNFEXAW22WLXMRJWIRSUWLLQO747FXTQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT_REPAIR", "error": "context deadline exceeded", "errorVerbose": "context deadline exceeded\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:327\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:996\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51", "Size": 1318912}

Context deadline exceeded, similar for download

2021-01-24T16:06:42.172Z ERROR piecestore download failed {"Piece ID": "KHMYUUB5I5VKLVSOZ7FQJTPKBVWQGTED4NPASRSDIGHZFPRFH63Q", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET", "error": "context deadline exceeded", "errorVerbose": "context deadline exceeded\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6:628\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:646\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Use of closed network connection

2021-01-27T14:06:07.534Z ERROR piecestore download failed {"Piece ID": "SPD7CBKVBXYEZF6T247C6BJXCE2D3LURVDWNIQIQ6PM3B4DUGYPQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET", "error": "write tcp 172.17.0.3:28967->172.17.0.1:50572: use of closed network connection", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:50572: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:228\n\tstorj.io/drpc/drpcwire.SplitN:29\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:322\n\tstorj.io/common/pb.(*drpcPiecestoreDownloadStream).Send:1089\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:580\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

Use of closed network connection, slightly different

2021-01-26T11:51:11.073Z ERROR piecestore download failed {"Piece ID": "YUO5ZPFKIMBH45ASEQD3XNHWNRBN3TZHJKZLBWW3B2ODZ5OOW6AQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:228\n\tstorj.io/drpc/drpcwire.SplitN:29\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:276\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:322\n\tstorj.io/common/pb.(*drpcPiecestoreDownloadStream).Send:1089\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:580\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

All of these will only impact a single transfer and are likely connected to either problems on the other side or long tail cancellation. If you have few of these (I had maybe 5-10 a day), it’s almost certainly not an issue.

Yet if anything like this happens during an audit it probably should be an error.

Not entirely sure about these ones:

Satellite was down when this error popped up

2021-01-28T22:10:58.867Z ERROR contact:service ping satellite failed {"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "attempts":11, "error": "ping satellite error: rpc: dial tcp 35.228.10.185:7777: connect: connection refused", "errorVerbose": "ping satellite error: rpc: dial tcp 35.228.10.185:7777: connect: connection refused\n\tstorj.io/common/rpc.TCPConnector.DialContextUnencrypted:106\n\tstorj.io/common/rpc.TCPConnector.DialContext:70\n\tstorj.io/common/rpc.Dialer.dialEncryptedConn:180\n\tstorj.io/common/rpc.Dialer.DialNodeURL.func1:101\n\tstorj.io/common/rpc/rpcpool.(*Pool).Get:87\n\tstorj.io/common/rpc.Dialer.dialPool:146\n\tstorj.io/common/rpc.Dialer.DialNodeURL:100\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Unable to get signee trust

2021-01-24T04:53:53.887Z ERROR piecestore download failed {"Piece ID": "3GDB4UEPXO7UC5DUH3QBWOLXRMHHZHKAPVFP3NZ7WKT2WPOTFXZA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "untrusted: unable to get signee: trust: rpc: dial tcp: operation was canceled", "errorVerbose": "untrusted: unable to get signee: trust: rpc: dial tcp: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:462\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

That last one can happen when you have connection issues too, so ERROR might be warranted. Depends on the context though.
Actually, I just noticed that one was preceeded by this.
2021-01-24T04:53:40.775Z ERROR servers unexpected shutdown of a runner {"name": "debug", "error": "debug: http: Server closed", "errorVerbose": "debug: http: Server closed\n\tstorj.io/private/debug.(*Server).Run.func2:108\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
Which might indicate a legitimate issue… Just looked a bit further. These last 2 errors happened during an update. So because the node was killed by the OS. It would be nice if it could do this a little more gracefully without throwing errors. But lets focus on that later. We can ignore these for now.

As for file does not exist, I would say that’s usually a legitimate error, but there can also be this…

2021-01-18T16:22:06.758Z        INFO    piecestore      upload started  {"Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 7963735671296}
2021-01-18T16:22:07.279Z        INFO    piecestore      uploaded        {"Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2021-01-18T16:22:08.992Z        INFO    piecestore      download started        {"Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2021-01-18T16:22:10.856Z        DEBUG   pieces  deleted piece   {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA"}
2021-01-18T16:22:10.856Z        INFO    piecedeleter    deleted {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA"}
2021-01-18T16:22:10.861Z        ERROR   piecestore      download failed {"Piece ID": "ZK4UXAIHY5T4FKW4UEDPNC7FISVP7U3U7MQD6NN6RX57UHO5XIJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:506\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

On a funny side note, had a chuckle when these lines popped up...

2021-01-24T16:04:01.897Z INFO piecestore download started {“Piece ID”: “ISCX5CD4QDBC76I2MIA2TA5VQHMGGVCOY36F2IDWQDGYERROR3JQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_REPAIR”}
2021-01-20T15:25:44.704Z INFO piecestore upload started {“Piece ID”: “V3QSGDEYKYRCGTOYIPAAERROR7NWE6CHDA6QXXQXUWOVUHYIGYVQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Available Space”: 7940842964992}

What are the odds…

SGC · January 30, 2021, 8:20am

bet its like when notices a new color or type of car or something and it’s all of a sudden like everywhere…
can’t trust our lying eyes lol

Alexey · January 30, 2021, 9:17am

Long tail cancellation too

BrightSilence · January 30, 2021, 9:32am

I think you’re right in most cases. Though this specific time it happened during a node restart. Either way cancellation implies some intentionality. So yes, I think this one should probably go the the warning column as well.

thepaul · January 31, 2021, 6:24pm

According to my calculations, the chance of ERROR showing up in a random 52-digit base32 string is about 1 in 699,051

Toyoo · January 31, 2021, 6:27pm

Meaning that for each 2 terabytes of 2.6MB-sized pieces you’ll get on average one of them. Not so uncommon?

thepaul · January 31, 2021, 6:27pm

I agree with your analysis of the errors, though. Those should be INFO or DEBUG lines for the most part. Failure to contact the satellite is definitely an error worth logging.

We do use warnings. They just don’t tend to show up as much as the other severities. The reason for that is largely psychological: to call something WARN, you’d probably have a scenario like:

something has gone wrong which the operator might want to know about
it is a thing that goes beyond “business as usual” (otherwise you’d call it INFO)
it is also a thing that is probably safe to ignore; we know how to proceed (otherwise you’d call it ERROR or CRITICAL)
the thing which has gone wrong is a fairly specific condition, well understood (otherwise we couldn’t be very sure of #3)
it is a condition that the programmer could predict going wrong (otherwise it wouldn’t get checked for specifically)

This might be the case for something like bitrot detected in the orders database, an OS setting was detected that is harming performance in some minor way.

On the other hand, something gets called ERROR when

something has gone wrong which the operator definitely needs to know about
OR
something has gone wrong which the programmer didn’t predict
OR
something has gone wrong which the programmer did predict, but when the code was written it wasn’t clear what should be done about the error, or how serious it is, and the error can’t just be kicked up to a higher level

Those cases come up a lot more, especially with a fairly new codebase. As the code matures, the instances of #2 and #3 there tend to get ironed out.

Really we’re only just now getting to the point where we can put a lot of effort into polishing the SNO experience, as opposed to building the initial product, finding and removing bugs, and eliminating performance problems. You’re right about those log lines, and you should expect us to make that situation better soon.

thepaul · January 31, 2021, 6:29pm

That sounds about right.

BrightSilence · January 31, 2021, 9:20pm

I appreciate your response and I’m happy to hear this is something you’re looking into. Looking forward to the updates!

I still think WARN might be a good fit here. All your requirements seem to apply to these log lines. Especially if they pop up a lot. They might indicate a performance issue. Though I guess this could be said by canceled transfers as well. But INFO would definitely be preferred over ERROR. And it may even make sense to catch these errors in code and log them as a canceled transfer to begin with.

That said I completely understand the reasons why these still show up as errors right now. It’s not always clear which situations will pop up a lot or how bad it is when they do. Especially in code that deals with network communication with different independently managed entities. And like I mentioned in my previous post, most of them have already been cleaned up. And I looked for common once. I’m sure even after the next round of fixes some will keep showing up or nee ones will be introduced with updates. In general it’s definitely getting better over time though.

I appreciate the nerdiness in actually calculating this too! Haha

Alexey · February 2, 2021, 4:17am

Please, add it to the first topic.