Today, we rolled out Noise in production to our edge gateways.
As part of the Noise rollout, we discovered that (1) our TLS stack has not been doing clean connection shutdown, and (2) Noise was. Unfortunately, we have gotten used to some of the side effects of not doing clean shutdown, so we also rolled out this change: https://review.dev.storj.io/c/storj/common/+/9906/3/rpc/connector.go (make sure to see the comment).
This is a bummer but in retrospect unsurprising that the nodes are now reacting this way. The reason TLS didn’t have these log messages is because the application logic was clued in that the connection was shutting down, so the connection reset that was still happening was swallowed and ignored.
What I don’t know is if your node is part of the winning set of nodes, and this is what happens even in a full success, or if this is the behavior your node sees only when it is canceled as part of a long tail. Do you have a hint there? Is this every request or just some? How frequent is this happening?
Assuming we continue the SetLinger(0) approach, which we may re-evaluate, I’ll try to think about how to clean up the logs on the node side. Your node is behaving as expected though as far as I can tell.
Thanks for the quick response and transparency - it’s refreshing.
I am seeing 10-15 ERROR messages each minute, with so many ERROR messages it makes it difficult / impossible as a node operator to check-in for a quick status to ensure everything is going well, reclassifying these as WARN/INFO would allow us to still filter / alert on ERROR.
Uploaded my full log to Deleted if you would like to review.
Every few seconds or less I see in storagenode logs errors like this:
...
connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:183\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223"}
Why? Dev provided a clear explanation of what happened - they have not been doing a clean shutdown for some time, we’re just seeing the ERROR messages now. Dev also confirmed that although these messages are appearing in our logs, they are not problematic, and have likely been suppressed prior to this release.
Is there an Enhancement Release needed to suppress the errors? Yes.
Are the presence of these errors indicative of a configuration problem? No
Ok, as you can see above on my sed processed live log, some (most?) of downloads (green out) don’t have this error, so it can be only for canceled/losing race requests, for sure not for all.
Frequency varies a lot. It is disturbing that so far SNOs have not known about the number of canceled downloads.
- Why do I suddenly hear so many gunshots? - Relax, they will shoot without silencers for a while. There is nothing to worry about. - Ah so? Until now I thought it was a quiet neighborhood.
IMO It isn’t simple error to ignore. It appears that all this time the SNOs were unaware of what was going on, that the success rate is much lower than they thought. It’s just that the errors were hidden and it is to be expected that they will be in the future.
Until now I had maybe 1 “connection reset by peer” error every few minutes, now average every few seconds.
Thank you for speaking out. I wondered why those messages should have been hidden in the past or why they will not be visible in future again - are stats tweaked or cancelled requests should be ignored anyway at all?
Sorry, I am confused the same way as you, @Dunc4n1d4h0, haha. Maybe @jtolio can bring some light into it, or @Alexey? Maybe also interesting for @BrightSilence referring the successrate.sh script (?).
Yeah, we’ve known the success rates we see can’t be correct, but we’ve never known the exact cause. I’m not sure this would explain that completely though. But if the solution leads to better logging of long tail cancellation that would be great.
I have seen a lot more cancelled transfers in the logs lately. Can this be connected? Haven’t yet noticed these errors though, but I haven’t looked for them specifically and I’m on my phone right now, so can’t easily check atm.
I have not noticed a significant increase in failed transfers on my end, just in cancelled ones. But I should add, that’s just based on casual glances at my logs scrolling by when I do other stuff.