CPU Pressure stalls caused by Tokio in the storagenode application (Linux)

Right, but you don’t need ongoing periodic backup for that, you can backup once, so if VM crashes you restore it, and data it operates on can live elsewhere. For storagenode you definitely don’t need ongoing backup. Nothing changes in the VM. All data is outside.

Well, they hiccup, so they are not always online technically… And the busier is VM – the longer the hiccup. I guess if this does not cause any issues besides cosmetic log entries – nothing needs to change.

rant

If I were you I would switch to bhyve on FreeBSD on ZFS. Snapshots are sane, no hiccups, you get jails instead of bloated containers, and no GPL anywhere. Bliss.

It’s quite baffling that people keep using popular tools inspire of there being objectively better alternatives available. Stuff like Proxmox, unRaid, Synology, Docker… people choose to suffer (and in some cases pay money for it!) on purpose, on their own volition! Can never understand that.

1 Like

It may work in VM and works. But when it causes problem for the host, then this setup perhaps has issues. So need to exclude abstract layers one by one to figure out the root cause.

I often meet here a combination of a Windows VM on the Linux host (usually VMWare or Proxmox) which almost always causes issues. If it’s a Linux VM then this combination has less issues mentioned on the forum and either not a popular solution or doesn’t have issues like with a Windows VM.

In Backup Settings, Advanced, enable the “fleecing” option.

Zfs snapshots work without hiccups on qemu as well. I use them a lot, crash consistency is good enough for vast majority of stuff and zfs send/recv works great, even for VMS that have lots of files. Its also good that I can just set up the script and it backs up all of my VMs (the ones that I want backed up, anyway) without having to do anything custom inside the VM.

Imagine backing up a storage node (not that it needs it, but just for the sake of argument as an example of a VM that has lots of small files) with something like rsync. It would take forever. OTOH, zfs send/recv would be fast, especially the incremental backups. If I wanted to, I could probably back the node up every 30 minutes or so.

If the data is not on the VM, then wherever the data is (probably on another VM), it still needs to be backed up.

I use a Linux VM inside a Linux host, it works reasonably well. I had some problems with performance (mostly evident during the tests with a customer that uploaded a lot of short-lived data), thugh near the end of the test I seem to have solved most of them.

1 Like

Thank you didn’t see the advance tab. But at the same time all drives are full at the moment. So I don’t know if this would work😭
Have to upgrade my server and add some nvme storage there. But ram is soo expensive. Like it almost doubled in costs

You conveniently ignored this:

You can just have the VMware behaviour of snapshotting a block dev in whatever state it’s in with or without a sync. The halt is an optional thing that does improve consistency for basically anything except a database.
It is a trade off. You pay for a microsecond hiccup and gain (guaranteed) guest filesystem level consistency instead of a “hopefully mostly consistent filesystem” if you sync and don’t freeze like VMware does or a “not at all consistent filesystem” if you just snapshot the block dev.

Neither ZFS, nor VMware nor qemu natively guarantee that the state when the sync in the VM is done == the state of the finished snapshot. This is a way to make sure there’s nothing else happening until the ZFS txg is actually closed and in the quiescing state with a new one open.

Is application aware backup better? YES, ESPECIALLY FOR A DATABASE. Can everyone afford a Veeam license? No. Does this improve the consistency of a filesystem you might be writing your pg_dumps or continuous logs to? Yes.

It can even be used to trigger a db dump before creating a snapshot with qemu-agent-command to have custom application aware backups.

1 Like

This is not what VMware does. VMware maintains block level consistency without needing VCPU halt.

Did you read my comment above at all?

Apples and oranges. Every database comes with dump tools.

Yes, this is the way.

But it has nothing to do with a hiccup we’re are discussing.

Yes. Block level consistency. Just like qemu snapshots without the halt. Just because two things are (self) consistent doesn’t mean they are consistent with one another.

Qemu can create a perfectly block level consistent snapshot without halting the VM. This is about making sure the block level consistent state in the snapshot is the exact same as the VM internal disk state the instant sync() returned, with not a single I/O between sync returning and the hypervisor receiving that information and starting the snapshot process (even if that’s just 3-4 syscalls back and forth).

This is a qcow/qed limitation that required a pivot in the first place. Using Qemu on any sensible storage plattform with actual snapshot support like Ceph, ZFS and even LVM where the pivot happens at the storage layer (or rather not at all) isn’t affected by this. Even using qcow2 file internal snapshots (instead of qcow/qed volume chains) don’t require creating a new layer.

1 Like

You mixing three different “consistencies” here:

  1. Guest filesystem crash-consistency (what sync/fsfreeze give you).
  2. Hypervisor bloc/qcow2 metadata consistency (what proxmox/QEMU is pausing for)
  3. Application-level consistency (databases persistent coherent state, etc.).

The Proxmox/QEMU “hiccup” we’re talking about is 2), not 1) and definitely not 3)

  • A guest sync() with no further writes already gives you a crash-consistent filesystem image. No VCPU halt required for that.
  • VMware achieves block-level consistency without stopping vCPUs by doing what I described earlier: redirect-on-write / redo-log so the on-disk point-in-time is well defined while the VM keeps running.
  • QEMU, in the Proxmox backup path above, instead chooses to freeze the VM while it twiddles its own block-layer metadata. That’s the stall showing up as CPU pressure. It’s an implementation issue with QEMU, not a fundamental requirement for crash-consistent backups.

Now, even if we accept your idealized sequence:
A. Guest sync() finishes.
B. Zero extra I/O.
c. Hypervisor starts snapshot immediately.

you still only get 1) and 2). Application state is still whatever was in RAM at that instant. Databases, queues, etc. are not magically consistent unless they participate (dump, checkpoint, whathaveyou). That was my original point: the guest is paying the pause penalty, but the user is not actually getting application-consistent backups in return.

In the context of this thread:

  • The storagenode’s actual data lives outside the VM.
  • The VM backup is periodic, causes visible hiccups, and buys neither application-level consistency nor anything they actually need for that workload.

Pure pain, no gain.

To reiterate:

Yes, QEMU could in theory implement a non-pausing, redo-log style snapshot like VMware.

No, that’s not what Proxmox’s current QEMU backup mode is doing.

—> the halt is pure downside: cosmetic logs + stalls, zero real benefit to the application.

My advice stands: either take a one-time VM backup for disaster recovery and then stop, or move to host-level/ZFS-level snapshots + in-guest, application-aware backups — what actually matters.

Back to the original point sort of.

What does a cpu pressure stall even mean? And how do you detect it?

As for operational setup, I run in a linux VM, but all my node’s dockers are in the same VM. I have like.. 6 in there. It’s all one docker compose file. Easier to manage and start/stop. Works fine.

1 Like

Hyper-V includes a memory snapshot and all virtual disks snapshots when you do a VM snapshot, so this snapshot has a full consistency, it continues to work from the same point where it was. It even works for databases, however, it’s better to do it like Veeam.
But seems qemu works differently.
Also I think there wouldn’t be a “CPU pressure” (at least I didn’t saw anything like that using this hypervisor for years).

Of course if

then it will not be consistent.

:index_pointing_up:
This is a main question. I think it’s because of VMs.

Qemu does not halt the CPU. It just instructs the VM to freeze disk IO and flush its filesystem metadata to disk. Then the snapshot is created. After that, the VM is told that disk IO may resume.

It is up to the VM if it actually does this or not.

This has the effect that the backup doesn’t need to replay the journal when it is started after a restore. You may argue that this is not needed, or perhaps good to use only if the data is journaled too, which is not the default.

Was (is?) this sometimes buggy? Oh yes. Can frozen disk io look like a stalled cpu? Also yes.

3 Likes

I think you are right: likely when fsfreeze blocks writes all io is stalled for the duration, guest thinks storage is stuck, (which is not wrong, it cannot flush pages, journal commits take forever, etc), everything is blocked.

Ultimately — don’t backup storagenode vm is the right solution here. There is no reason to.

Also my claim in the above comment that qemu does not support journal based incremental block tracking is seriously outdated as well, my apologies: here is the documentation stating otherwise. Dirty Bitmaps and Incremental Backup — QEMU documentation

So yeah, no need to halt the vcpu, but the disk IO stall feels like one — everything is stuck on a syscall

2 Likes

Which usually/normally causes no issues, except when the “duration” is to long, due to Host being under pressure or short on resources.

2 Likes

I feel a bit bad for OP. We’ve turned this into a “backup technologies for virtual environments” thread :squinting_face_with_tongue:

Not really.

I think his/her problem was/is caused by the Backup running on an under resourced server (all disk’s being nearly full, high CPU usage).

This causes the “freeze” to take to long, which causes weird shit to happen in the VM.

Whether you backup a VM or not is another matter.

2 Likes

There are additional quirks with backups in proxmox, especially with proxmox backup server.

It doesn’t do snapshots as you think it would. Normally, you use LVM or Ceph or ZFS that natively supports snapshots, but this is not what’s happening.

Instead, there is a layer inserted into qemu’s storage system that, during a backup, when the VM writes something, that part of the disk is first read and sent to the backup server, commited there and only then is the write allowed to proceed. On a loaded backup server, this can take a good while. Sure, its COW snapshots but write performance may be heavily affected. The workaround is to use a so called “fleecing” disk that stores the blocks locally on the host and transfers them to the backup server later. This works well in practice.

All in all, you can do snapshot style backups with storage that doesn’t natively support snapshots, but it may be quite slow.

Of course, no need to backup a storj node. If it fails, just replace.

1 Like