Recently my storagenodes are causing me some problems and after some debugging I found out the culprit: Sometimes, over time, some storagenode creates ever more TCP Sockets, without connecting or binding them.
This goes on until 100s of thousands of unused sockets are created and all networking infrastructure becomes sluggish and unstable.
this is only a very small subset, as the output was over 500k lines longâŚ
I am a storagenode operator for several years now and the exact same setup ran for many months without a problem. Itâs only been happening since around last week, so Iâm guessing some recent update caused it.
Of course Iâm willing to provide any further information if needed, hope this can be fixed soon
Youâre right, it looks like a file descriptor leak. Thatâs a big deal.
You said this happens âsometimesâ. So, not every time? Any ideas about whatâs different between the times it doesnât happen and the times it does?
When it happens, are you able to tell approximately how fast new sockets are being allocated?
Finally, if you know your debug port, could you send the output of curl 'http://your-node:debugport/debug/pprof/goroutine?debug=2' ? That will output the stacks of all existing lightweight threads (âgoroutinesâ) and shouldnât include any sensitive information.
edit: updated to add: we have confirmation of this on some other nodes, and are stopping the rollout of v1.68.2 for now while we diagnose.
Iâm not sure whatâs different, but now that Iâm observing it, it could have always been the same node. I restarted the node about when I wrote this post, and now the same node has over 31k open sockets again. The other ones are fine for now.
Well about 15-20k per hour.
edit: updated broken link
This was taken at around 30k open sockets.
Has there been any progress to this? Will it get fixed in the next release?
The same node still produces lots of sockets for me, while the other ones are fine
Transfer.sh is marked by Virus Total as dangerous. Even my antivirus blocked a transfer from a friend through it. I think it is also used to deliver malware by bad actors.
Iâm afraid we hit a dead end with the lead we thought we had. Weâre not even sure if the reproduction we saw is the same issue as yours. Weâre still investigating, but unfortunately it wonât be fixed in the next release.