TCP out of memory

I’m seeing these messages after 6 to 48 hours of node uptime. The node eventually crashes. Repeatable for about the last 2-3 weeks. This node has been operating properly since storj was in beta, a few years ago.

I could not find any open issues related to tcp at Issues · storj/storj · GitHub.

dmesg
[Mon Nov 6 11:11:06 2023] TCP: out of memory – consider tuning tcp_mem
[Mon Nov 6 11:13:34 2023] TCP: out of memory – consider tuning tcp_mem
[Mon Nov 6 11:29:25 2023] TCP: out of memory – consider tuning tcp_mem


storj VERSION:
v1.90.2

System:
odroid-xu4

Free:
total used free shared buff/cache available
Mem: 1991 1099 142 1 749 829
Swap: 5119 73 5046

Kernel:
uname -a:
Linux storj1 5.4.256-267 #1 SMP PREEMPT Mon Sep 11 14:40:45 EDT 2023 armv7l armv7l armv7l GNU/Linux


I dug into the log messages and found its the storagenode process:

./get_sockets.sh > sockets.txt
for d in /proc/*/; do
sudo echo “===”$d"==="
sudo ls -l $d/fd | grep -c socket
done

cat sockets.txt | grep -v === | sort -nr | head
699
91
42
25
23
20
16
14
13
13

sockets.txt
===/proc/11969/===
699

systemd-cgls:
│ ├─11969 /app/storagenode run --config-dir config --identity-dir identity --metrics.app-suffix=-alpha --metrics.interval=30m --version.>


After 2 crashes, I modified all the relevant tcp memory settings with the same result:

/etc/sysctl.conf
net.core.netdev_max_backlog=30000
net.core.rmem_max=134217728
net.core.wmem_max=134217728
net.ipv4.tcp_max_syn_backlog=8192
net.ipv4.tcp_rmem=4096 87380 67108864
net.ipv4.tcp_wmem=4096 87380 67108864
net.ipv4.tcp_mem = 17196 22934 34392

Please let me know what details you need to help troubleshoot this, or if it is already a known issue.

Please try to use these values: TCP: Out of Memory — Consider Tuning TCP_Mem - DZone

Please note my sysctl.conf settings listed in my post, as they are from TCP: Out of Memory — Consider Tuning TCP_Mem - DZone .

Then I have no more other to suggest. I saw such a problem in a first time.

--------------------- :arrow_up:

are the spaces here relevant?

my system has these

net.ipv4.tcp_mem = 303354       404472  606708
net.ipv4.tcp_rmem = 4096        131072  6291456
net.ipv4.tcp_wmem = 4096        16384   4194304

and nothing in /etc/sysctl.conf

Good catch, but spaces are not relevant here. This can be checked by verifying eg.:

cat /proc/sys/net/ipv4/tcp_wmem
4096 87380 67108864

no time to set up an VM to verify :man_running:

I don’t know if this amount of open sockets for storj is normal?

I have seen storj using about 99 tcp connections when I manually go in and check on it, but I caught this burst when the errors were showing up in dmesg.

This hardware has been running since beta, so a few years, without any of these changes to the tcp kernel settings. It has been using the defaults determined by the kernel all this time.

The way this looks, is like something may of changed recently.

Possibly in the storj code or the behavour of the network in the last few weeks, causing hung tcp connections? I don’t see any TCP related code changes recently, but I haven’t read through every code checkin.

How do you measure this number of open sockets?

Above is the process I used to find the highest user of sockets on my system. Here is the process broken down in more detail for you.

Create a file called ./get_sockets.sh that looks like the following.

Set the file to executable using chmod +x get_sockets.sh.

Then run it and output to a file for processing in the next step:
./get_sockets.sh > sockets.txt

Then run this to sort the file by number of tcp sockets open:

In my case it looked like this:

Then open the file with any editor like vi, and search for the first/highest number, and see what proces it is. In my case it was 11969.:

Then correlate the process id to the command, with the following. In my case the process was storagenode:

114 right now for a node with active uploads, 51 for one that is out of free space, and hence only handles downloads.

I find your numbers not out of ordinary, I can imagine many hundreds of connections made during traffic peaks.

Same issue here, same solution what I did try. Waiting for the result at the moment.

This issue itself were able to freeze my docker host VM first time in a 3 year :slight_smile:

I were watching my logs for a while and a strange thing showed up imedietley:

I have much more “upload started” log entry than “uploaded”. As far as I remember it supposed to be in a equal amount.

I checked my dmesg again, and it looks like my docker container host is dropping networking after a few of the tcp out of memory errors. The container logs say the container itself didn’t restart, but networking definately did a restart.

Here I started the container at [Tue Nov 7 09:48:55 2023].

Then about 20 minutes later at [Tue Nov 7 11:11:29 2023], tcp out of memory error show up.

The container resets its networking at [Wed Nov 8 08:50:32 2023].

[Tue Nov 7 09:48:55 2023] docker0: port 2(veth95fe5e9) entered forwarding state
[Tue Nov 7 11:11:29 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:31:21 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:32:39 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:35:06 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 11:38:39 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 17:59:45 2023] TCP: out of memory – consider tuning tcp_mem
[Tue Nov 7 18:07:45 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 06:53:51 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 07:27:32 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 07:56:21 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 08:37:50 2023] TCP: out of memory – consider tuning tcp_mem
[Wed Nov 8 08:50:32 2023] docker0: port 2(veth95fe5e9) entered disabled state

I’ve seen this happening with a (cheap consumer) router misbehaving due to its routing table overflowing. In essence, it was passing the initial connection packets, but at some point the connection mapping was dropped and the node never received a proper disconnection packet.

At some point connections were timed out by node’s operating system, and in my case I didn’t have any other adverse effects, but it still wasn’t pretty.

This was fun to debug.

I’m right behind your theory. After a point (by time) the uploading process simply can’t be done, which overwhelming something on the network layer.

Dunno why it’s happening. Im going with the same setup / routers etc 3y now.

Only things different. My config.yaml had only the original values with this new node. I trying to use one of the tweeked version of one of my existing node now, and check the result

Anyway my node died few times during the night: but atleast I’ve got a code 137 for that, and not like earlier where the node simply lost the network connection, and hang up forever.

I assume the new TCP memory settings had an effect, and simply my node just run out of memory at this time.

It feel like I solved with my old config file. But will see. Will coming back if not :slight_smile:

The only relevant settings is:
filestore.write-buffer-size: 8096.0 KiB

Interesting, my router and storj node have been unchanged for a long time also; since the beta, so a few years.

I haven’t changed any config.yaml settings at all yet.

I took the node offline to run a full e2fsck on it, so its out of comission for half a day.

I’ll have to take a look at that write buffer size setting.

Nope. Did not solved anything. It’s crashed later now.

I’m also thinking on fsck.