Live connections doesn't closed for hours

Odmin · January 3, 2020, 3:57pm

Can I ask you to look into this log?

curl -s localhost:7777/mon/ps|grep -c "live-request"
87

Maybe it specific test pattern or connections is not closing…

BrightSilence · January 3, 2020, 4:07pm

I’m seeing uploads lasting up to 19h as well. Shouldn’t the storagenode eventually cancel these as well? And by eventually I mean, way before 19h.

Here’s my /mon/ps

Vadim · January 3, 2020, 4:13pm

this could be hanged connections, i think this is why i got before max connection problem

littleskunk · January 3, 2020, 5:17pm

Vadim · January 3, 2020, 5:22pm

when it will be made?

Odmin · January 3, 2020, 5:31pm

So, as I can see, it known issue and you already have a fix for it.

eagleye · January 3, 2020, 5:55pm

I"ve seen this problem during uploads recently with the new release. My temp blob files have been growing, I assume from broken connections during upload. Before I’d only have 1-3 temporary blobs hang around for long periods of time but on this latest release Blobs are forgotten about for several hours and accumulate in the temp folder. I found a fix to this is stopping the node deleted the open files and after restart all temp files were gone.

I’ve also seen error messages that the connection was abruptly terminated by the other side. Not a context cancelled message.

littleskunk · January 3, 2020, 6:11pm

That commit was part of v0.28.2. I noticed the long living connection a few releases ago and I was thinking this commit would have fixed it. At the moment I don’t have the time to dig deeper.

BrightSilence · January 3, 2020, 6:15pm

No worries, good to see you’re already on it. Good thing we have drpc, this would have been a bigger issue with the grpc limit.

Odmin · January 3, 2020, 6:26pm

No problem, I just put it on GitHub, please look into when you have time.

It can be a problem when system connections will out of scope, so I pay attention before this will happen in the near future.

BrightSilence · January 3, 2020, 6:32pm

Don’t know why, but they’re all gone now on my node.

[512235751686384200] storj.io/storj/pkg/process.root() (elapsed: 52h10m52.156871271s)
 [7490332208289568717] storj.io/storj/storagenode.(*Peer).Run() (elapsed: 52h10m51.678453899s)
  [2353641729313367517] storj.io/storj/pkg/server.(*Server).Run() (elapsed: 52h10m48.149627319s)
  [2644417434057814430] storj.io/storj/private/version/checker.(*Service).Run() (elapsed: 52h10m48.150647236s)
  [2466735627193528578] storj.io/storj/storagenode/bandwidth.(*Service).Run() (elapsed: 52h10m48.15065179s)
  [5600420241766549131] storj.io/storj/storagenode/collector.(*Service).Run() (elapsed: 52h10m48.150653428s)
  [4954280923293530513] storj.io/storj/storagenode/console/consoleserver.(*Server).Run() (elapsed: 52h10m48.150230879s)
  [5245056628037977426] storj.io/storj/storagenode/contact.(*Chore).Run() (elapsed: 52h10m48.150684086s)
  [1933690206600671020] storj.io/storj/storagenode/gracefulexit.(*Chore).Run() (elapsed: 52h10m48.150743978s)
  [3888190082107815398] storj.io/storj/storagenode/monitor.(*Service).Run() (elapsed: 52h10m48.150407809s)
  [1998278115584795812] storj.io/storj/storagenode/orders.(*Service).Run() (elapsed: 52h10m48.149999436s)
  [8621010958459408624] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 52h10m48.149612051s)

[2273161994833705216] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 274.479502ms)
 [5406846609406725770] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 274.455166ms)

[6650619257456726737] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 53.420399ms)
 [560931835174971483] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 53.396706ms)

[6798739510462542581] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 1m52.821534354s)
 [709052088180787326] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 1m52.821508365s)

[9186670542452764942] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 1.052132658s)
 [3096983120171009687] storj.io/storj/storagenode/piecestore.(*Endpoint).doDownload() (elapsed: 1.05212117s)

Just a few shorter ones now, which is perfectly fine. Didn’t restart the node or anything. It just resolved itself.

littleskunk · January 3, 2020, 6:35pm

The other side of the connection (most likely uplink or gateway) might have restarted.

Odmin · January 3, 2020, 6:35pm

This happened because tests was stopped and rerunning

Summary

Vadim · January 3, 2020, 9:11pm

and when I writen that there is hanging connections, no one belived me, all suport told it because i have limit. But limit cant reach if connections not stuck.

BrightSilence · January 3, 2020, 9:15pm

I think it was a bit of both. The limit you had in the config imposed a limit on your node that shouldn’t have been there. If it weren’t there, the hanging connections would not make your node reject other uploads. Removing that setting from the config.yaml was still the right thing to do. But yes, it looks like there is also still a small issue that leaves connections active.

Vadim · January 3, 2020, 9:16pm

there should be connetion lifetime limit, like 1h or something.

Vadim · January 3, 2020, 9:18pm

Because i had this limit the problem just started to be seen. We dont see it because there is unlimitet connection amount. But this is memory and cpu leack if connections stuck.

littleskunk · January 3, 2020, 9:47pm

For a good reason because that was what happend.

That is not correct. In fact I have to wait a full week before I am lucky to find any long living connections. Did you hit the limit once per week or every time? That should tell your what the reason was for hitting the limit.

Negative. If an upload takes 1h and is still active we should keep it open and keep transfering data. The reason for removing the timeout in the first place was an issue with mobile apps. They have limited bandwidth but still want to upload and download large files.

Your limit has nothing to do with the output @Odmin provided. I don’t see the relation. Once again keep in mind that you hit your limit every day but you need to be lucky to find some open connections.

This issue is basically known since day 1 of the drpc change. It is very easy to get the /mon/ps output even if you have no limit in your config. The limit is not a requirement for this issue.

Can you show me the memory and cpu leak please? My node is keeping connections open but I don’t see any inpact on memory or cpu.

Vadim · January 3, 2020, 10:05pm

today i made node restart and in log, there is 26 faild upload connections. after shutdown request.
at same time there is no such many connections.
And before when i had limit there was 40 stuck connections.

littleskunk · January 3, 2020, 10:07pm

You can try it yourself: Guide to debug my storage node