Connection reset by peer errors

Today, we rolled out Noise in production to our edge gateways.

As part of the Noise rollout, we discovered that (1) our TLS stack has not been doing clean connection shutdown, and (2) Noise was. Unfortunately, we have gotten used to some of the side effects of not doing clean shutdown, so we also rolled out this change: https://review.dev.storj.io/c/storj/common/+/9906/3/rpc/connector.go (make sure to see the comment).

This is a bummer but in retrospect unsurprising that the nodes are now reacting this way. The reason TLS didn’t have these log messages is because the application logic was clued in that the connection was shutting down, so the connection reset that was still happening was swallowed and ignored.

What I don’t know is if your node is part of the winning set of nodes, and this is what happens even in a full success, or if this is the behavior your node sees only when it is canceled as part of a long tail. Do you have a hint there? Is this every request or just some? How frequent is this happening?

Assuming we continue the SetLinger(0) approach, which we may re-evaluate, I’ll try to think about how to clean up the logs on the node side. Your node is behaving as expected though as far as I can tell.

2 Likes