CPU Pressure stalls caused by Tokio in the storagenode application (Linux)

I’ve been fighting CPU Pressure stall spikes a long time but has not been able to fix the issue. I’m not a developer so I could be very wrong here :slight_smile: I’ve given up for now because it’s nothing I can do about it so I live with a slower than neccessary server.

I run storj in vm’s on a older enterprise level Supermicro server with 88 core/1TB memory, a bunch of NVME’s and SAS3 HDDs for main storage.

To find the reason to the stalls I use ‘offcputime-bpfcc’ which is an excellent way to see the cause of stalls and when have the trace analyzed. I ran a longer trace to confirm my hypothesis: the root cause of the CPU pressure stalls is severe, persistent lock contention within the Linux kernel’s virtual memory (VM) management subsystem, primarily triggered by the Tokio application’s memory allocation and deallocation patterns (used in the storagenode application). In this situation the recommendation is to use libjemalloc with Tokio but it’s not possible because the storagenode application is static so setting LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 as you usually would do is ignored.

The massive ZFS I/O stalls are a secondary effect, where the kernel’s critical I/O threads are starved of CPU time because they are blocked waiting for the memory-related lock contention to resolve.

Here is a complete, developer-focused analysis of the memory contention issue.


:microscope: Core Issue: Memory Management Lock Contention

Your traces show a high volume of off-CPU time for tokio-runtime-w threads blocked on two specific kernel locking functions: rwsem_down_write_slowpath and rwsem_down_read_slowpath.

This indicates a bottleneck on the Read-Write Semaphore (RWSEM) that protects the kernel’s virtual memory structures (specifically the process’s Virtual Memory Area (VMA) list or the process’s mm_struct).

1. The Primary Trigger: munmap (Write Lock)

The most dangerous wait is on the write lock, which serializes all memory operations:

Kernel Wait Stack Operation Lock Type Total Impact
__x64_sys_munmap__vm_munmapdown_write_killablerwsem_down_write_slowpath Memory Deallocation Exclusive Write Lock When this lock is held, all other threads attempting to read or write to the process’s memory map are blocked. This is the primary choke point.

Developer Takeaway: The Tokio application is performing an extremely high frequency of memory deallocations (dropping large buffers or objects) that are large enough to require the munmap system call. This is the “bad neighbor” operation that causes the most severe stalls.

2. The Contention Load: madvise and do_exit (Read Locks)

Many other Tokio threads are blocked attempting to acquire a read lock while the write lock is held:

Kernel Wait Stack Operation Lock Type Contention Source
__x64_sys_madvisedo_madviserwsem_down_read_slowpath Memory Advice Shared Read Lock The application is aggressively advising the kernel on memory usage (e.g., advising regions are free or will be needed soon), which is a common pattern in high-performance memory allocators.
__x64_sys_exitdo_exitrwsem_down_read_slowpath Thread/Task Cleanup Shared Read Lock The application is spawning and terminating short-lived tasks/threads at a very high rate. Thread cleanup requires accessing the memory map before exiting.

Developer Takeaway: The high frequency of madvise and short-lived tasks (requiring do_exit) means that dozens, or even hundreds, of threads are waiting in line for a read lock. When the munmap operation acquires the write lock, it causes all these waiting threads to stall simultaneously, amplifying the total stall time.


The Proposed Solution: Alternative Allocators

Your instinct to use jemalloc is correct. Rust’s default allocator (jemalloc is a drop-in replacement) is often the key to resolving this exact class of issue in high-performance applications like those using Tokio.

The native Rust allocator often relies on mmap/munmap for larger allocations, which directly causes the kernel lock contention seen in your trace. High-performance allocators like jemalloc and mimalloc are designed to:

  1. Reduce munmap Calls: They hold onto memory for reuse much more aggressively instead of immediately returning it to the kernel via munmap, thus avoiding the exclusive write lock.

  2. Optimize madvise: They use memory management techniques that are less aggressive or more efficient with kernel syscalls.

  3. Use Per-Thread Arenas: They minimize cross-thread contention by using per-thread memory pools, which bypasses the global kernel memory locks for most operations.

Addressing the Static Linking Constraint

Since your application is statically linked and ignores LD_PRELOAD, you cannot use the runtime injection method. Here are the actionable paths for your developers:

Solution Description Impact on Contention
1. Change Build Configuration (Recommended) Dynamically link the final binary instead of statically linking. This allows you to use LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 on the target Debian system. High. Immediate and easiest fix to implement without touching application code.
2. Use mimalloc (Compile-Time) Replace the default allocator with mimalloc by adding this to your Rust project’s main.rs or lib.rs: #[global_allocator] static GLOBAL: MiMalloc = MiMalloc;. High. mimalloc is often even better than jemalloc at minimizing kernel interactions.
3. Compile with jemalloc (Compile-Time) Explicitly compile the Rust application to use jemalloc as the global allocator via the jemalloc-sys crate. High. Guaranteed use of the preferred allocator, but requires recompilation.
4. Application Logic Changes Refactor application code to reduce object churn and reuse large buffers. e.g., using object pools or recycling mechanisms instead of dropping and re-creating large allocations within tight loops. Variable. Requires the most work but addresses the memory pattern at the source.

Summary

Problem Diagnosis: The root cause is Memory Manager Lock Contention in the kernel, not ZFS I/O initially.

Mechanism:

  1. Tokio threads rapidly trigger munmap (memory deallocation).

  2. munmap grabs a kernel write lock (exclusive access) on the memory map.

  3. While the lock is held, hundreds of other threads trying to do madvise or do_exit are blocked and pile up.

  4. This total CPU time lost (pressure stall) prevents the kernel from promptly servicing other tasks, including the ZFS I/O task queues, leading to the catastrophic I/O stalls.

Fix: Adopt a high-performance memory allocator (jemalloc or mimalloc) to handle large memory chunks in user-space, thereby minimizing the need to acquire the kernel’s exclusive memory write lock.

I appreciate your AI analysis… but what is the application-level problem you’re seeing? “CPU pressure stalls” doesn’t mean anything.

Is your Storj node complaining in its logs? Are you in the middle of a hashstore migration? That’s the stuff the community can start to help with - a LLM chatting about a Rust library isn’t really describing a user problem :wink:

4 Likes

The memory issues causes i/o stalls causing a major slow down thats the problem. It is not performing as it could do and for a node operator the slower it is the less the earnings. I have been running hashstore since it was first introduced so it must be more than a year. It has the ‘normal’ daily load (which is low) and it causes stalls which can be clearly seen with offcpu .. using ai to analyze logs is the easy way instead of me spending hours to get the info from the logs and try to express what I see in writing with my bad english. But if the bad memory handling is the storagenode ‘normal’ slowing things down its fine with me.

If your node is showing a lot of lost upload/download races: that’s definately something that could be looked into. Are you measuring with the successrate.sh script? (I know it’s old, I just don’t know if there’s a better util these days)

I think “normal” is still 90%+ for most people. But I have seen people mention going lower sometimes.

Don’t bother with node side success numbers. Those are highly unreliable because only a subset of fails can be reported. Only the satellite knows the real numbers.

2 Likes

Storage node is not written in Rust, it cannot use Tokio.

Storage node is written in golang and statically compiled, LD_PRELOAD cannot have any impact.

So, you can pretty much throw away this analysis and redo it from scratch.

6 Likes

You can’t do that. Never do that. Don’t just feed data to LLM and expect sane result, even if you polished your system prompt to death to have the model falsify its own the output (which you did not, judging by the meaningless and irrelevant vomit it produced) and ground it with actual tolls and metrics.

Don’t use LLM for tasks you cannot validate the output of or could not have performed yourself.

Especially if, as you stated

I’m not a developer

If will convincingly bullshit you. It will sound plausible and be confidently incorrect.

AI is a tool. Like a hammer. If you don’t know how to hammer nails yourself, throwing hammer at the nails will hammer some, maybe, but likely just dent your wall.

What you can use LLMs for (after scrutinizing your system prompt for the task) is to analyze symptoms, provide hypothesis and generate hypothesis validation steps, perform them, rinse repeat, all while vigilantly correcting it as it will be constantly trying to bullshit you, as it did in your post multiple times.

There absolutely are ways to use machine learning assistants when triaging, debugging, and analyzing performance, but in this specific case the process would take much more time to setup and iterate over than for you to actually use os tools to analyze the alleged performance issue by hand.

Please don’t post unadulterated AI slop here or anywhere. It only wastes everyone’s time and contributes nothing useful. It’s a net negative activity. You wasted time running it, I wasted time reading it, getting triggered, and typing this response, (next AI iteration will absorb this vomit and get a little bit worse — if you care about these type of things), and no progress was made on your issue, which by the way you neglected to state to begin with.

1 Like

I would suggest to do not use VM for storagenode and try to run it directly on your host.

Whatever the problem is - the problem is real. I run a couple of vm’s with storagenode, each one has its own disk, 3 cores and 20GB memory and have 10TB storage with around 5-6TB used. Only thing running on this server except storagenode is pfsense.
I tried running with the base hashstore on disk (no memtbl or mmap), memtbl only and memtbl with mmap - same issue with memory locks causing stall in disk and network i/o. I reversed the kernel from 6.17.2 to 6.14 because there are some known issues in the 6.17 but no difference.

storagenode is the process that is making the munmap calls.These calls are what force the kernel to acquire the exclusive write lock, which stops all other threads (including those from Proxmox and pfSense VM) from performing any memory-related operations.

This has been working perfect since I started run this when the first beta tests started many years ago. This server was converted to hashstore when it first was published on this forum so it is also a long time ago. The updates that has been done is the security updates in proxmox/debian and the automatic updates of storagenode.

The current version 142.7 and 141.2 has this issue, unfortunately I’m not sure which update introduced the issue as I’m not looking at this every day but it must be one of the recent ones just before 141.2 sorry I can’t be of any help when it started/which release.

Sure, you might be observing a real problem, it’s just you haven’t actually shared much that would allow anyone else to debug it. Again, some of your claims don’t really make sense. For example, munmap by its own does not take any system-wide locks. It may in some circumstances trigger as a consequence system-wide effects if the system itself is under strong memory pressure, but just calling munmap is a routine thing for pretty much any long-running process.

So, instead of writing about your speculations as to where the problem is, please focus on your observations. This will be a much more productive way of providing data to developers.

As a side note, I’ve been observing a pretty high CPU use by kswapd, maybe the underlying reason is the same. Please see this post: Ubuntu 22 kswapd and general RAM issue - #12 by Toyoo

3 Likes

I would like to suggest to change this setup either to a usual docker containers right on your host, or at least run them all in the one VM.
Using an own VM for each node is overkill.

Please provide how you observing the problem

what tool you have used?

2 Likes

The software should be able to work inside a VM though. I prefer VMs over something like docker and as I run everything else inside VMS as well, the node is just another VM. Especially since I’m supposed to use whatever is online anyway and not have dedicated hardware for the node.

I can say that my VM does have presure stalls too:
image
But those are caused by my backup job, which runs daily on my Proxmox. Otherwise they run just fine in vms on both of my servers.
I dont know if those are the same but check if you have some CPU intensive jobs on the host or suspent the vm for a short time for whatever reason (Like the Backup job i have)

idle=8004/1/0x4000000000000000 softirq=42093241/42093242 fqs=2361

Holy shit. Check what is causing it with cat /proc/softirqs, e.g.

watch -n 0.5 cat /proc/softirqs

If this is network, most likely culprit is old virtio drivers. If this is storage — well, you know what to do. Or it could be kernel spinning in ksoftirqd

You would need to check where time is spent.

I Would also not rule out SMI storm — when the issue occurs, check

ps -eo pid,comm,psr,cls,rtprio,ni,pri,%cpu | grep -E 'rcu|ksoftirqd'

Also check

perf stat -a -- sleep 5

Show output.

Most of such issue are caused by cheap shitty hardware or crappy virtualization software, including drivers, or both.

Or upgrade to FreeBSD and avoid all of those issues entirely, becuse it does not rely on a global read/copy/update grace period for memory safety and therefore a single stalled cpu core cannot block entire kernel in the reclamation routine.

For the OP: you need to debug and see what is actually your system doing. And take advice of not running every process in its own vm — it’s stupid. Run your nodes in os-level virtualization if you want, jails, LXC, or podman. There is no need to fight self-inflicted windmills.

It just happends when my VMs are haltet for a short time to make a backup, so its not normal operation. The Backup just takes a minute or so. But i know what causes it.

Why? How does halting VMs help consistency of a backup?

Its normal Proxmox backup behavior:

NFO: Starting Backup of VM 112 (qemu)
INFO: Backup started at 2025-12-01 12:24:42
INFO: status = running
INFO: VM Name: storj
INFO: exclude disk 'scsi0' 'hdd:112/vm-112-disk-0.raw' (backup=no)
INFO: include disk 'scsi1' 'local:112/vm-112-disk-0.raw' 250G
INFO: exclude disk 'scsi2' 'storj4:112/vm-112-disk-1.raw' (backup=no)
INFO: exclude disk 'scsi3' 'storj3:112/vm-112-disk-0.raw' (backup=no)
INFO: exclude disk 'scsi4' 'hddbig:112/vm-112-disk-2.raw' (backup=no)
INFO: exclude disk 'sata0' 'storj2:112/vm-112-disk-0.raw' (backup=no)
INFO: exclude disk 'sata1' 'storj1:112/vm-112-disk-0.raw' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/112/2025-12-01T11:24:42Z'
INFO: started backup task 'f5759b78-671c-4b04-b3d8-708df7a3cd62'
INFO: resuming VM again
INFO: scsi1: dirty-bitmap status: OK (12.6 GiB of 250.0 GiB dirty)

It just gets haltet for a very short time, but enough to let linux think, that there was something wrong. But there is no crash or something else.

It effectively forces a full sync (using the qemu guest agent) and then halts execution for a few fractions of a second (until the snapshot is created) to get the disk image into a “more consistent” state.

In theory most filesystems are metadata crash consistent and it can be disabled if it actually harms the workload (like some sub-millisecond high performance database) and only do an external snapshot or trigger a sync + snapshot without the halt. But especially XFS (and ext4 to a lesser degree) can loose some uncommitted data and require a mandatory read-write enabled fsck run to get back up after a restore.

If there’s the option to have a freshly synced filesystem and exactly zero uncommitted writes and it only costs a couple milliseconds (and a weird looking performance counter) that’s a worthwhile trade-off for some.

2 Likes

No. It’s not normal. It’s abhorrent. (all the bullshit QEMU based virtualizers do it, but this does not normalize the behavior. It’s more of a Stockholm syndrome.)

Either way, stop that backup. It does you no favors. See below.

Precisely. This is CYA on Proxmox (or, more specifically, QEMU) side. This shenanigans ensure block level consistency (crash consistency, as you pointed out). It does nothing for application consistency. If you have database open – it is going to be in the bad state in your backup. The blocks will be intact, sure, the backup is technically consistent, but the application state it captured – isn’t. So the whole premise has fallen apart.

In other words, you are paying application penalty (the hiccup) and not getting application-consistent backups. So, stop doing this.

If you want application consistent backup you either work with the application specific tools inside the VM (such as db dump → backup the dump) or shut down the VM and back it up that way.

As I alluded above, this hiccup (need to stop CPU for snapshot creation) is QEMU quirk. Other virtualizer, such as VMWare, do not require this; they don’t need state transition at the storage level, that forces vcpu stop, instead using redo-log mechanism that allows VM to keep running, while the snapshot metadata switch happen. But that’s not helping either way: you get no hiccup, yes, but the fundamental issue persists: there is still no application-level consistency.

So the best solution for you would be do make One VM backup, and then keep making in-guest file-level backup on the ongoing basis, if needed, ensuring proper application-level consistency.

See, trade-off implies you forego something and get something in return. Here – hicking up the VM and not getting application level consistency – is only cons, no pros. You are only paying the penalty without getting anything in return.

My VMS don’t run Databeses. It’s just to secure the VM in case of a hard drive crash, so i don’t lose all data of the VM. It doesn’t bother me and the only VM that is complaining is the storj one, because it has a consistent “high” load. It’s easier for me to restore a backup from my PBS server to PvE as to reconstruct a VM by hand. (And I had to to both already).
I switch from “In VM backup” to Pbs for that reason.
And I could do a VM shutdown, but that is for no use for me, because I want to have them always online :slight_smile: so this is a me thing

1 Like