GE's failing? for no apparent reason

BrightSilence · March 10, 2023, 8:14pm

I’m pretty sure that is the receiving node rejecting the transfer due to their request limit. Not on @SGC’s end.

SGC · March 14, 2023, 1:58pm

did you figure anything out…?
still getting DQ on sats doing GE… this is just the most recent batch.

thepaul · March 14, 2023, 4:19pm

Nothing very definite. What I see recorded in the satellite is that this node (1PfrHppTf6o89mECFm84xiN9eHfTYJAhwfdTgKzK1roBWAzFXG) successfully transferred 56761 pieces and failed to transfer 31481 pieces, about a 36% failure rate. Since that is more than the 10% limit, the graceful exit was marked as failed.

I don’t see that in your logs of the last few days, however. What I see over those last 3 days is closer to a 3% failure rate. If there really were 31481 failures, then most of them must have happened earlier, sometime over the preceding month.

If you want to go as far as to send me all of the logs starting from 2023-02-09, or at least grep them all for '1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE', I can find where the bulk of the failures happened (or, maybe, establish that there weren’t actually that many failures, and diagnose what went wrong on the satellite side).

SGC · March 14, 2023, 4:34pm

sure, the DQ’s doesn’t really matter much to me, the held amounts are pretty limited.
i just want to help improve the GE feature, mainly because i was already complaining about it when i read the blueprint years ago lol… GE is specs are rather strict lol.

ill send you those logs today or tomorrow, they might be a bit big… but maybe if i compress all of them it won’t be to bad… they are logs after all.

Krawi · March 14, 2023, 5:15pm

It may help if i share my experience with GE on this satellite at the last weekend:

Node ID: 1MP7YijXPMuTeBedVBsw4gNyTW5J8xaCkYhGuJrHyTeyjbckdQ

2023-03-10T07:58:59.167Z        INFO
piecetransfer   piece transferred to new storagenode
{"Process": "storagenode",
"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
"Piece ID": "YO6ZYQAHZQ5DD2X3W6XYNG2CKXTECYSYUQVHAV72VN2AIHSKAVMA",
"Storagenode ID": "122H6s6LxwcMaT7QbFGRsZoVZDZ7Dj1wQzesUQSwLRQxxzf1w6C"}

2023-03-10T07:58:59.167Z        ERROR
gracefulexit:chore.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE@saltlake.tardigrade.io:7777
failed to send notification about piece transfer.
{"Process": "storagenode",
"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE",
"error": "write tcp 172.17.0.5:39374->34.94.153.46:7777: write: connection reset by peer",
"errorVerbose": "write tcp 172.17.0.5:39374->34.94.153.46:7777: write: connection reset by peer
        storj.io/drpc/drpcstream.(*Stream).rawFlushLocked:365
        storj.io/drpc/drpcstream.(*Stream).MsgSend:424
        storj.io/common/pb.(*drpcSatelliteGracefulExit_ProcessClient).Send:79
        storj.io/storj/storagenode/gracefulexit.(*Worker).Run.func3:101
        storj.io/common/sync2.(*Limiter).Go.func1:49"}

I had thousands of this errors in my logs.

thepaul · March 14, 2023, 7:19pm

This could be happening because the piece transfer took too long. If I remember right, the node will be asked to transfer that piece again, so it’s usually fine.

thepaul · March 15, 2023, 2:11pm

Ok, your logs show 7,941 piece transfer failures and 244,693 successes. The real count isn’t as straightforward as that because the server can request a piece to be transferred more than once, and transfers are usually retried after failures. But that indicates a maximum failure rate of 3%, not 10% or 36%.

I’m trying to correlate this with the final counts stored in the satellite and with the logs available from the satellite, but all three are telling wildly different stories. At this point I’m pretty convinced that the satellite is at fault, counting failures incorrectly, and possibly wasting a lot of node bandwidth transferring pieces multiple times.

I’m going to have to get some work scheduled to get to the bottom of this.

SGC · March 15, 2023, 2:20pm

i got more nodes seeing the issues and some of them also have logs… sadly i did my own custom logging solution which didn’t work at scale and i ended up only having logs for 10 nodes.

but let me know if you need more.
good to hear that its not a problem on my side, it did seem very weird as i’m not really seeing any signs of issues and have been running my nodes now for just over 4 years without ever having anything be DQ.

youngest nodes are 2 years and my setup hasn’t really changed that much, if anything its gotten a lot better, faster and more reliable.

SGC · March 27, 2023, 10:07am

any update on this…?

i got a bit more information for you.

it seems AP1’s GE’s are all successful.
also europa-north-1 and saltlake all seems to fail around the same %

all GE except one has finished or been DQ now.
the node that hasn’t finished is kinda special because it has a bandwidth.db that takes up 7.7GB so maybe thats related to why its been so slow…

will be interesting to see if it also completes the GE successfully on ap1
because then AP1 will have a 100% success rate for GE’s

hope it helps.

thepaul · March 27, 2023, 10:23pm

That is indeed interesting. I have a suspicion that the problem here is related to connection timeouts. It would make sense for different satellites to exhibit the problem at different levels if that is the case.

I haven’t been able to dig much deeper yet, though. I’ve got a ticket scheduled but some other work had to come first.