Ubuntu 22 kswapd and general RAM issue

I have a server on Ubuntu 22 with some nodes inside. After weeks of normal work (near 0% nvme use) is starting to use nvme for writing and reading. kswapd use 30/40% of cpu.
vm.swappines=1 (tried to swapoff -a but nothing changed)
vfs pressure = 50
I don’t understand if is something related to storagenode logs or other.
Nodes are installed in docker with default config
PS: could log level to warn help?

usually nodes consumes more memory if you have a slow storage, so you need to fix this issue if possible.
Do not use NTFS, zfs (one drive without SSD cache), exFAT, BTRFS and use only ext4 if this is Linux. Some SNO reported that even XFS have issues, so best try is to use ext4 where is possible.
The warn log level will not help anyhow.

I’m on ext4 but using an “old” jbod connected with even older 9200-8e. I’m going to workaround with a script in crontab that free ram sometime: sync; echo 3 > /proc/sys/vm/drop_caches

Why would you do this? What do you think this will accomplish?

High kswapd CPU use in presence of free RAM might be a sign of memory fragmentation, I observed this myself. IIRC setting sysctl’s vm.min_free_kbytes to some larger number helped.

How can I measure RAM fragmentation?

PS: I have vm.min_free_kbytes = 67584. what do you suggest?

I noticed that I have this problem after more than a week from a reboot and when last time I manually free up entire ram I resolved high nvme usage.
I’m here to listen your advise. I’m not a SA

check where nodes write the logs and the ordes and the databases . if there are more nodes and the space is slow, it could get bottlenecked/ bloated logfiles maybe.

just a guess, i dont know linux inside.

Basically , the RAM acts like a cache when the drive can’t keep up. So, is the drive being used for something else at the same time as running the node? Is it an SMR or CMR drive? How fast is your connection? Some RAM use is not a big deal unless you are finding it impacts your node’s performance in some way.

please give us a hint about the drives and how much nodes and data.

Flushing ram to SSD is treating symptoms. It’s impossible to say without actually measuring but I think this is what happens:

  • some other process is leaking pages
  • pages are not paged out at the sufficient rate
  • this gradually evicts disk cache
  • this increase IO load on the disk
  • results in higher response time
  • node caches data in ram, reducing available space even more
  • until everything grinds to a halt.

Flushing data to swap is a bandiaid and is treating secondary symptoms. It’s pointless.

You need to find who is leaking pages, and ensure most of ram is available at all times for disk caching.

Since node works for a while after memory is made available — there is enough ram on the system to sustain this usecase.

Separately, you can take steps to reduce disk IO — disabling access time updates and write synchronicity helps a lot.

Memory fragmentation is not a thing on modern architectures. Memory is virtualized, and 64bit space plenty to accommodate even the most ridiculous access pattern. Or is your device 32bit?

See this Stackoverflow article, specifically the cat /proc/buddyinfo command. For example on my NAS with 16GB RAM:

root@…:~# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      2      2 
Node 0, zone    DMA32   3272  18652   7369   2570    625     33      1      0      0      0      0 
Node 0, zone   Normal   6557   9962  14916   8801   4556   1531     64     16      3      7      0 

Each column represents contiguous free chunks of size 2ⁿ bytes, starting from 4kiB on the leftmost column, ending at 4MiB on the rightmost column. You can see that my node has no free 4MiB chunks, 7 chunks of size 2MiB etc.

It’s bad if right-most of the columns are zeros. How many of them need to be non-zero, that depends on hardware and software. kswapd gets hyperactive when an allocation for a large contiguous chunk fails. Usually software does not require contiguous chunks, but hardware drivers or options like large pages sometimes need them. So, let say, if your hardware drivers like allocating 64kiB buffers, then at least the fifth column should be nonzero. Though, figuring out what allocations are needed in your specific case seem to be somewhat difficult, I didn’t bother.

vm.min_free_kbytes helps in the sense that it tries to keep some amounts RAM free both from software allocations and buffers/cache, so there’s higher chance that a large chunk of contiguous free memory is available. As such, it is a dirty workaround, but it’s good enough for me. It is set on my NAS to 1048576, that is, 1 GiB, and I’ve mostly stopped seeing kswapd taking any nontrivial CPU time. What number is right for you, you will need to figure out on your own.

Some details:
JBOD 6gbit connected to server (old lsi 9200-8)
1 HDD enterprise grade (18tb seagate/toshiba) per node. No other service use it.
10/2 gbit ftth connection (this is not the bootleneck I’m sure)
swappines=1
vm.vfs_cache_pressure = 50

Swappines is 1 and I always have free swap 99% (2g) even when system starting to use NVME intensely.
Kcompactd use 15/20% of cpu now. when problem grow up kswapd start to use cpu even more than kcompact (cpu ryzen 4650g)

@Toyoo
Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 2
Node 0, zone DMA32 13366 8043 4156 1043 746 471 244 137 45 1 0
Node 0, zone Normal 220998 512044 516227 2415 230 183 143 94 48 17 0

Same here but I need to wait when the problem come back. Usually takes some days after I flush my ram. My vm.min_free_kbytes = 67584

RAM:
x PageSize:4KB RAM-Memory Swap-Space High-Memory Low-Memory x
x Total (MB) 128091.7 2048.0 - not in use - not in use x
x Free (MB) 13600.6 2046.2 x
x Free Percent 10.6% 99.9% x
x Linux Kernel Internal Memory (MB) x
x Cached= 17494.7 Active= 4755.4 x
x Buffers= 5400.8 Swapcached= 0.0 Inactive = 30226.3 x
x Dirty = 85.2 Writeback = 0.0 Mapped = 1667.9 x
x Slab = 78507.7 Commit_AS = 92092.8 PageTables= 163.9

So, it’s not memory fragmentation. Well, at least one potential cause crossed out.