I’m very happy you said that, because I completely agree that these are in fact canceled transfers and not failed ones. But let me change that slightly.
I looked through the storagenode code and the success rate script, and I think the script storagenode code should be changed to count connection reset by peer
as a Canceled
result instead of Failed
(in both the UDP and TCP cases).
For the success rate script this fortunately doesn’t matter much, because the intention is in the name. We were really looking for what percentage of transfers were successful. The only place where the errors matter are audits and the cancelations don’t play into it there as they not canceled before the time out after 5 minutes and that would always result in an error anyway.
The problem is that this is by far not the only error that is actually a canceled transfer.
There is “use of closed network connection”:
2021-09-16T13:43:19.147Z ERROR piecestore download failed {"Piece ID": "SH3MFVIOJRD3GWVDWRTRIFOU5GFHGS3FURPZ7J5JYATXLJGDTOUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "write tcp 172.17.0.3:28967->172.17.0.1:48157: use of closed network connection", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:48157: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
and “use of closed connection”:
2021-09-16T12:53:59.931Z ERROR piecestore download failed {"Piece ID": "SH3MFVIOJRD3GWVDWRTRIFOU5GFHGS3FURPZ7J5JYATXLJGDTOUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "tls: use of closed connection", "errorVerbose": "tls: use of closed connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
and “context deadline exceeded”:
2021-09-16T11:11:13.481Z ERROR piecestore download failed {"Piece ID": "5PREQBB23F6NQPBWHCCIDCASW64SWFFVORUKZ66G752ZO36KJ5XA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "error": "context deadline exceeded", "errorVerbose": "context deadline exceeded\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6:656\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:674\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
and “connection reset by peer”:
2021-09-16T10:07:30.258Z ERROR piecestore download failed {"Piece ID": "VYV46TFATCN6ETNZXE2RL3P6PIVOVRBVBZGGWJHY2D6XR4ACDQ5Q", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET", "error": "write quic 172.17.0.3:28967->172.17.0.1:33156: connection reset by peer", "errorVerbose": "write quic 172.17.0.3:28967->172.17.0.1:33156: connection reset by peer\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:315\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:390\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:302\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:608\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
and “unexpected EOF”:
2021-09-16T13:19:16.627Z ERROR piecestore upload failed {"Piece ID": "FMWVGFES6JTQ72ZUKO7CEATUI2LEMFFUIUKFSQHHYBUDTNFKHD3Q", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "unexpected EOF", "errorVerbose": "unexpected EOF\n\tstorj.io/common/rpc/rpcstatus.Error:82\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:342\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:209\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52", "Size": 0}
All of these are almost certainly caused by canceled transfers in a similar scenario you described.
I could of course build in exceptions for all of those in the script, but the thing is… the success rate script is the least of the problems this causes. Every time a SNO wants to debug something, they ask about these errors. And us forum regulars have to tell them they can ignore these. Which we’re happy to do, but it causes an extra cycle of finding something, asking about it and having to go back searching again. It’s a massive distraction and furthermore teaches SNOs to ignore errors, which isn’t exactly the message you want to send out there.
And so I decided to not bother adding in exceptions for every possible way a canceled transfer could generate an error. The node calls them errors, so I’m sticking to that. I have previously brought this subject up in a discussion about using the WARN level instead of ERROR for these. But in fact they should probably just be correctly logged as canceled transfers to begin with.
Generally when someone is complaining about problems with success rate, they refer to exactly that, the rate of success. I’m pretty sure @LinuxNet was referring to that as well, which means that the total of canceled + failed went up. I think that would still be something to look into.