Storagenode creates way too many TCP Sockets

Recently my storagenodes are causing me some problems and after some debugging I found out the culprit: Sometimes, over time, some storagenode creates ever more TCP Sockets, without connecting or binding them.
This goes on until 100s of thousands of unused sockets are created and all networking infrastructure becomes sluggish and unstable.

Some output of lsof -p (storagenode pid):

storageno 1026179            root 1107u  sock       0,8      0t0 353589940 protocol: TCPv6
storageno 1026179            root 1108u  sock       0,8      0t0 353588189 protocol: TCPv6
storageno 1026179            root 1109u  sock       0,8      0t0 353589177 protocol: TCPv6
storageno 1026179            root 1110u  sock       0,8      0t0 353589918 protocol: TCPv6
storageno 1026179            root 1111u  sock       0,8      0t0 353590874 protocol: TCPv6
storageno 1026179            root 1112u  sock       0,8      0t0 353591823 protocol: TCPv6
storageno 1026179            root 1113u  sock       0,8      0t0 353588203 protocol: TCPv6
storageno 1026179            root 1114u  sock       0,8      0t0 353589923 protocol: TCPv6
storageno 1026179            root 1115u  sock       0,8      0t0 353589932 protocol: TCPv6
storageno 1026179            root 1116u  sock       0,8      0t0 353589928 protocol: TCPv6
storageno 1026179            root 1117u  sock       0,8      0t0 353588208 protocol: TCPv6
storageno 1026179            root 1118u  sock       0,8      0t0 353590880 protocol: TCPv6
storageno 1026179            root 1119u  sock       0,8      0t0 353593505 protocol: TCPv6

this is only a very small subset, as the output was over 500k lines long…

I am a storagenode operator for several years now and the exact same setup ran for many months without a problem. It’s only been happening since around last week, so I’m guessing some recent update caused it.

Of course I’m willing to provide any further information if needed, hope this can be fixed soon :slight_smile:

1 Like

You’re right, it looks like a file descriptor leak. That’s a big deal.

You said this happens “sometimes”. So, not every time? Any ideas about what’s different between the times it doesn’t happen and the times it does?

When it happens, are you able to tell approximately how fast new sockets are being allocated?

Finally, if you know your debug port, could you send the output of curl 'http://your-node:debugport/debug/pprof/goroutine?debug=2' ? That will output the stacks of all existing lightweight threads (‘goroutines’) and shouldn’t include any sensitive information.

edit: updated to add: we have confirmation of this on some other nodes, and are stopping the rollout of v1.68.2 for now while we diagnose.

2 Likes

I’m not sure what’s different, but now that I’m observing it, it could have always been the same node. I restarted the node about when I wrote this post, and now the same node has over 31k open sockets again. The other ones are fine for now.

Well about 15-20k per hour.

edit: updated broken link
This was taken at around 30k open sockets.

Unfortunately, that node is still on v1.67.3

1 Like

Thank you! This should help a lot. We already have some good leads from that output.

If the file sharing is a problem, you could always use https://transfer.sh/ - it’s powered by Storj!

6 Likes

Glad I could help :slight_smile:
Haha thank you, will keep that in mind next time I need file sharing :smiley:

Has there been any progress to this? Will it get fixed in the next release?
The same node still produces lots of sockets for me, while the other ones are fine

Transfer.sh is marked by Virus Total as dangerous. Even my antivirus blocked a transfer from a friend through it. I think it is also used to deliver malware by bad actors.

Are you sure it’s .sh and not .com that is reported?

The service at transfersh.com is of unknown origin and reported as cloud malware.

I’m afraid we hit a dead end with the lead we thought we had. We’re not even sure if the reproduction we saw is the same issue as yours. We’re still investigating, but unfortunately it won’t be fixed in the next release.

is it posible that he just has some kind of atack?

Yes, that’s a very distinct possibility. Even if it is, though, the node should be more resistant.

Okay, just let me know if you need any more data.