I’m seeing these messages after 6 to 48 hours of node uptime. The node eventually crashes. Repeatable for about the last 2-3 weeks. This node has been operating properly since storj was in beta, a few years ago.
dmesg
[Mon Nov 6 11:11:06 2023] TCP: out of memory – consider tuning tcp_mem
[Mon Nov 6 11:13:34 2023] TCP: out of memory – consider tuning tcp_mem
[Mon Nov 6 11:29:25 2023] TCP: out of memory – consider tuning tcp_mem
storj VERSION:
v1.90.2
System:
odroid-xu4
Free:
total used free shared buff/cache available
Mem: 1991 1099 142 1 749 829
Swap: 5119 73 5046
I don’t know if this amount of open sockets for storj is normal?
I have seen storj using about 99 tcp connections when I manually go in and check on it, but I caught this burst when the errors were showing up in dmesg.
This hardware has been running since beta, so a few years, without any of these changes to the tcp kernel settings. It has been using the defaults determined by the kernel all this time.
The way this looks, is like something may of changed recently.
Possibly in the storj code or the behavour of the network in the last few weeks, causing hung tcp connections? I don’t see any TCP related code changes recently, but I haven’t read through every code checkin.
I checked my dmesg again, and it looks like my docker container host is dropping networking after a few of the tcp out of memory errors. The container logs say the container itself didn’t restart, but networking definately did a restart.
Here I started the container at [Tue Nov 7 09:48:55 2023].
Then about 20 minutes later at [Tue Nov 7 11:11:29 2023], tcp out of memory error show up.
The container resets its networking at [Wed Nov 8 08:50:32 2023].
[Tue Nov 7 09:48:55 2023] docker0: port 2(veth95fe5e9) entered forwarding state
[Tue Nov 7 11:11:29 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:31:21 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:32:39 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:35:06 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:38:39 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 17:59:45 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 18:07:45 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 06:53:51 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 07:27:32 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 07:56:21 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 08:37:50 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 08:50:32 2023] docker0: port 2(veth95fe5e9) entered disabled state
I’ve seen this happening with a (cheap consumer) router misbehaving due to its routing table overflowing. In essence, it was passing the initial connection packets, but at some point the connection mapping was dropped and the node never received a proper disconnection packet.
At some point connections were timed out by node’s operating system, and in my case I didn’t have any other adverse effects, but it still wasn’t pretty.
I’m right behind your theory. After a point (by time) the uploading process simply can’t be done, which overwhelming something on the network layer.
Dunno why it’s happening. Im going with the same setup / routers etc 3y now.
Only things different. My config.yaml had only the original values with this new node. I trying to use one of the tweeked version of one of my existing node now, and check the result
Anyway my node died few times during the night: but atleast I’ve got a code 137 for that, and not like earlier where the node simply lost the network connection, and hang up forever.
I assume the new TCP memory settings had an effect, and simply my node just run out of memory at this time.