Recently my storagenodes are causing me some problems and after some debugging I found out the culprit: Sometimes, over time, some storagenode creates ever more TCP Sockets, without connecting or binding them.
This goes on until 100s of thousands of unused sockets are created and all networking infrastructure becomes sluggish and unstable.
this is only a very small subset, as the output was over 500k lines long…
I am a storagenode operator for several years now and the exact same setup ran for many months without a problem. It’s only been happening since around last week, so I’m guessing some recent update caused it.
Of course I’m willing to provide any further information if needed, hope this can be fixed soon
You’re right, it looks like a file descriptor leak. That’s a big deal.
You said this happens “sometimes”. So, not every time? Any ideas about what’s different between the times it doesn’t happen and the times it does?
When it happens, are you able to tell approximately how fast new sockets are being allocated?
Finally, if you know your debug port, could you send the output of curl 'http://your-node:debugport/debug/pprof/goroutine?debug=2' ? That will output the stacks of all existing lightweight threads (‘goroutines’) and shouldn’t include any sensitive information.
edit: updated to add: we have confirmation of this on some other nodes, and are stopping the rollout of v1.68.2 for now while we diagnose.
I’m not sure what’s different, but now that I’m observing it, it could have always been the same node. I restarted the node about when I wrote this post, and now the same node has over 31k open sockets again. The other ones are fine for now.
Well about 15-20k per hour.
edit: updated broken link
This was taken at around 30k open sockets.
I’m afraid we hit a dead end with the lead we thought we had. We’re not even sure if the reproduction we saw is the same issue as yours. We’re still investigating, but unfortunately it won’t be fixed in the next release.