Trouble with UDP

LinuxNet · September 14, 2021, 6:13pm

Hello, everyone.

I have had massive problems with my success rate because it has been falling for a few days and I also have a lot of errors in the log (connection reset by peer)

At first I suspected the problem with my new Mikrotik router and / or new DSL line and tried for days to find a solution.

Today after several attempts and changes to the Mikrotik I came up with the idea to deactivate the UDP port of the nodes.

Man … I should have figured it out earlier. That would have saved me hours.

I have now deactivated all UDP ports and the success rate immediately increases again. The many connection resets also no longer appear in the log.

I guess I lost a lot of data as a result, at least my trash folders are relatively large.

Now my question: What can I do to improve the UDP connection? As long as I don’t know the solution, I close the UDP ports. Do I have to restart the container or can I just close the port? Not that my nodes lose their online status afterwards due to the closed port.

Alexey · September 15, 2021, 5:55am

Unfortunately not all routers can handle UDP well. Your router seems from this category. At the moment the solution could be either replace the router or disable UDP rule.

LinuxNet · September 15, 2021, 10:29am

I dare to doubt that my Mikrotik cannot handle UDP

But well, I’ll leave UDP for now if it doesn’t affect my nodes.

Can

-p 28971:28967/udp \

still remain in the container or does it have to be removed?

baker · September 15, 2021, 12:21pm

If you are just dropping all UDP traffic at the router/firewall level, you can either remove the docker port mapping or keep it. The node will function the same. Is it possible your router is treating the traffic on UDP as a DoS attack and dropping packets?

thepaul · September 15, 2021, 2:00pm

I also expect your router isn’t the problem. I may be able to help diagnose, but first let me clarify: what do you mean by “success rate”? Is that the online percentage as reported on the storagenode console?

BBQMan · September 15, 2021, 4:27pm

UDP was giving me problems too so took out the ports supposedly passing it. Pretty sure it wasn’t the router since it is higher end but problems are gone. I still get the log warnings noting the QUIC protocol but am unsure if that matters. Every since I disabled UDP, my nodes are steadily getting their reputation back.

BrightSilence · September 15, 2021, 5:43pm

It’s a community thing. A while back we started looking at what percentage of transfers finished successfully. In the end a few of us started creating tools to parse the logs for these numbers. Changes to node code impact these results significantly from time to time though, so you can’t really compare success rates across releases as much. More here: Success rate script - Now updated for new delete terminology since v1.29.3

thepaul · September 15, 2021, 5:58pm

Oh! Right, I’ve seen that script. Hmm. I have a feeling that what gets shown in the log when a UDP connection is terminated/canceled/timed out is not quite the same as what gets shown in the log when those events happen to a TCP connection. If that is the case, then everything on OP’s node could have been working perfectly fine.

LinuxNet · September 15, 2021, 8:47pm

@baker I don’t think so. I can’t see a drop in the firewall log.

@thepaul This error appears in the node log when UDP is activated

2021-09-14T17:09:53.355Z ERROR piecestore download failed {“Piece ID”: “6S4Y3HL6DZGJ6SECPAA3SDUOSX2QMHWXFDDOVMUDOCAXC6IPRGTA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET”, “error”: “write quic 172.17.0.4:28967->135.181.29.50:52136: connection reset by peer”, “errorVerbose”: “write quic 172.17.0.4:28967->135.181.29.50:52136: connection reset by peer\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22”}

thepaul · September 16, 2021, 1:53pm

Perfect, that was the log line I needed. I looked through the code and the success rate script, and I think the script should be changed (cc @BrightSilence) to count connection reset by peer as a Canceled result instead of Failed (in both the UDP and TCP cases). When an uplink drops the connection to your node because it has enough pieces already, that could show up as either download canceled or download failed ... connection reset by peer depending on whether your node is waiting on an ACK from the uplink or if it’s still trying to send packets (and also depending on a whole heap of other considerations, like various timeout values in the node config, the QUIC config, or in the TCP stack).

Now, if the percentage of canceled connections goes up significantly when you allow UDP, there might still be a problem and we can diagnose from there.

BrightSilence · September 16, 2021, 2:59pm

I’m very happy you said that, because I completely agree that these are in fact canceled transfers and not failed ones. But let me change that slightly.

I looked through the storagenode code and the success rate script, and I think the script storagenode code should be changed to count connection reset by peer as a Canceled result instead of Failed (in both the UDP and TCP cases).

For the success rate script this fortunately doesn’t matter much, because the intention is in the name. We were really looking for what percentage of transfers were successful. The only place where the errors matter are audits and the cancelations don’t play into it there as they not canceled before the time out after 5 minutes and that would always result in an error anyway.

The problem is that this is by far not the only error that is actually a canceled transfer.

There is “use of closed network connection”:

2021-09-16T13:43:19.147Z        ERROR   piecestore      download failed {"Piece ID": "SH3MFVIOJRD3GWVDWRTRIFOU5GFHGS3FURPZ7J5JYATXLJGDTOUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "write tcp 172.17.0.3:28967->172.17.0.1:48157: use of closed network connection", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:48157: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

and “use of closed connection”:

2021-09-16T12:53:59.931Z        ERROR   piecestore      download failed {"Piece ID": "SH3MFVIOJRD3GWVDWRTRIFOU5GFHGS3FURPZ7J5JYATXLJGDTOUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

and “context deadline exceeded”:

2021-09-16T11:11:13.481Z        ERROR   piecestore      download failed {"Piece ID": "5PREQBB23F6NQPBWHCCIDCASW64SWFFVORUKZ66G752ZO36KJ5XA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "context deadline exceeded", "errorVerbose": "context deadline exceeded\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6:656\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:674\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

and “connection reset by peer”:

2021-09-16T10:07:30.258Z        ERROR   piecestore      download failed {"Piece ID": "VYV46TFATCN6ETNZXE2RL3P6PIVOVRBVBZGGWJHY2D6XR4ACDQ5Q", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET", "error": "write quic 172.17.0.3:28967->172.17.0.1:33156: connection reset by peer", "errorVerbose": "write quic 172.17.0.3:28967->172.17.0.1:33156: connection reset by peer\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

and “unexpected EOF”:

2021-09-16T13:19:16.627Z        ERROR   piecestore      upload failed   {"Piece ID": "FMWVGFES6JTQ72ZUKO7CEATUI2LEMFFUIUKFSQHHYBUDTNFKHD3Q", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "unexpected EOF", "errorVerbose": "unexpected EOF\n\tstorj.io/common/rpc/rpcstatus.Error:82\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:342\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:209\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52", "Size": 0}

All of these are almost certainly caused by canceled transfers in a similar scenario you described.

I could of course build in exceptions for all of those in the script, but the thing is… the success rate script is the least of the problems this causes. Every time a SNO wants to debug something, they ask about these errors. And us forum regulars have to tell them they can ignore these. Which we’re happy to do, but it causes an extra cycle of finding something, asking about it and having to go back searching again. It’s a massive distraction and furthermore teaches SNOs to ignore errors, which isn’t exactly the message you want to send out there.

And so I decided to not bother adding in exceptions for every possible way a canceled transfer could generate an error. The node calls them errors, so I’m sticking to that. I have previously brought this subject up in a discussion about using the WARN level instead of ERROR for these. But in fact they should probably just be correctly logged as canceled transfers to begin with.

Generally when someone is complaining about problems with success rate, they refer to exactly that, the rate of success. I’m pretty sure @LinuxNet was referring to that as well, which means that the total of canceled + failed went up. I think that would still be something to look into.

thepaul · September 16, 2021, 5:14pm

Haha, I take your point: the code needs to change to make this more clear. I agree entirely. Most of the cases you specify should not be logged as errors.

However, it will take time for that change to get made, reviewed, merged, and deployed, and in the meantime it might be useful for the script to do the necessary transformations (but, of course, it’s your script, you get to make the decisions!).

I’m not sure that these correlate to canceled connections. These might instead be logical bugs.

You’re probably right. Although, given that the main complaint seems to have been about the existence of “connection reset by peer” log lines, I’m not sure.

LinuxNet · September 20, 2021, 11:36am

Sorry for the late reply. Do not find much time at the moment to check the forum regularly.

If I have understood correctly everything is okay, even with activating UDP, the script just records this incorrectly and therefore falls in the success rate?

Yes, of course I looked at the script and then saw the error messages in the log and wasn’t sure whether this was a harmless error, if I’m unsure I’d rather get advice.

BrightSilence · September 20, 2021, 1:08pm

I think the open question was whether you actually saw the success rate drop or just a shift from canceled to failed. If the success rate dropped, that may still signify an underlying issue. If it’s a shift from canceled to failed it’s likely just a difference in how these cancelations are handled.