ZFS speed and optimizations for VM

Debian 12 VM on qemu/kvm hypervisor (also Debian 12). The virtual disk is a zvol on the HV and formatted as ext4 inside the VM. The pool is made up of 3 RAIDZ2 vdevs, 6 drives each and has a SLOG (mirror of SATA SSDs). When the node VM was Debian 10, there were no load spikes. I can still go back to Debian 10 if I decide to. I upgraded to Debian 12 to see if it would change the performance (different caching etc) and it did, just not in the direction I wanted :slight_smile:

I do not want to dedicate that server to running just the node, I also want to run my own stuff on it, since it has enough free space to run some of my VMs. Also, any trick I learn to make the VM run faster would be useful for me elsewhere, so it’s no so bad.

1 Like

Cache disabled on the hypervisor? Or writeback or write through?
That setting alone (if you don’t use disabled) you could fill pages with to understand it correctly.

Ahhh, not good. Not good at all. That is at least 3 :leafy_green: leaves :wink:
What disks are we talking about? RAW?

Ahh, now it is gettin even worse. But at least you got the right pool geometry with 6 drives and RAIDZ2, so you won’t suffer storage inefficiency, only rw amplification.
I would say that the jump from mirror (50%) to 6wide RAIDZ2 (66%) is not worth it for blockstorage, but you do you.

Well, that is double bad. First, it is a waste, since you don’t need sync writes. So no point in speeding them up with SLOG. Just a waste of NAND. But you double your waste be mirroring them, even though SLOG never gets read from unless there is a system halt. So unless you expect a system halt and your SLOG dying at the same time, it makes no sense to mirror them.
Totally not worth it for storj.

3 Likes

cache=“none” on the virtual disk settings

Yes, the zvol is passed to the VM as a block device, so the format is raw.

Normally I prefer 6-drive RAIDZ2 over 3x mirrors because RAIDZ2 is guaranteed to survive two failures, but while 3xmirror can survive up to 3 failures, it can also die after two failures if I’m unlucky (both copies fail).
Also, before recently, Storj never had this much traffic. If I get the opportunity to borrow some hard drives I’ll remake the pool as mirrors. I have a spare drive, so I’ll be able to quickly rebuild the pool if one drive fails, hopefully before another drive fails, no need to wait a while for a new drive to arrive.

Writes from VMs are sync (O_DIRECT bypasses write cache, so it makes use of SLOG) and having a SLOG helps with that. SLOG is a mirror because if there is a crash and the SSD fails I would lose my data. Using a mirror for SLOG is recommended IIRC. L2ARC does not need redundancy (I’m not using it here, ARC gets high enough hit rate).

I have other VMs there as well, The server is not dedicated to Storj.

1 Like

Great

I think you mean 2way mirror, since a 3way mirror can survive any two disks failing.

I hear that argument a lot when it comes to advantages of RAIDZ over mirror.
Hot take I have no studies to back it up: Mirror is still saver if you use two different brands.
Reason is that the probability of 3 drives in a RAIDZ2 going bad due to a bad batch or firmware error or what not, is higher than two drives of two different batches from a mirror failing at the same time. Of course you could achieve the same in RAIDZ2 by using 3 different batches. Question is if you get a good deal on 3 different brands.
But I get your point and agree.

It makes use for actual sync writes. You don’t force every write to be written in sync by also using O_DSYNC.

Sure, the question is how high is the chance of that actually happening? TrueNAS uses one drive for SLOG in their machines.

Hmm… I don’t know what is wrong with your setup, but something is way off. Maybe just your benchmark measurements :slight_smile:

Yes, I meant that I could arrange 6 drives as a single RAIDZ2 vdev or three 2-drive mirror vdevs. My drives are the same model, when I bought them (3 separate times, 3 different capacities) those drives, while still being server-grade, were cheaper compared to others.
Using different brand drives would be better for reliability, or at least getting drives from different batches.

It possibly was the scheduler. On Debian 10, using mq-deadline inside the VM probably helped, but using it on Debian 12 probably caused the load spikes. I also changed the scheduler to none on the host drives (apparently zfs has its own scheduler).

Or maybe not, maybe there is something else. The benchmark program shows 150-190MB/s, but it does not take delays and selection algorithm into account. The new algorithm makes it difficult to figure out if my node is working as it should.

Looking at iostat, the write speed that shows up on the virtual disk inside the VM, also shows up on the SLOG SSDs.
So, if I understand this correctly, when the node writes data in async mode inside the VM, the data gets stored in RAM for some time and then written to the virtual disk. When this happens, the host probably sees those writes as sync, so it uses SLOG.

To me it looks like you have cache set directsync (word cache is a little bit misleading here, since it is no cache and I also don’t know if naked KVM calls that differently than Proxmox).

  • When a node writes data async inside the VM, the data gets stored in TXGs on RAM, fs calls back that if finished, writes it to the pool. No ZIL involved.

  • When a node writes data sync inside the VM, the data gets stored in TXGs on RAM, written to ZIL on your pool, fs calls back that if finished, data gets written again to disk (CoW). So essentially written twice to disk.

  • When a node writes data sync inside the VM and you have SLOG, you just moved the ZIL from the pool to the SLOG. The data gets stored in TXGs on RAM, written to ZIL on your SLOG, fs calls back that if finished, data gets written to your pool.

For the two sync variants, if you have a crash in between, ZFS will reread the write from your ZIL to write it to your pool.

So for async writes, your SLOG (which is a ZIL, just not on the pool but on a separate device) should not be touched!

I would not worry too much about the performance itself. The bigger problem of :leafy_green: nodes (or IT systems in general) is that everything works fine until one day it doesn’t. That is why I would be very keen on finding out the issue if I were you.

That is what I don’t like about :leafy_green: IT systems. IT has become complex enough, no need to make my life even harder. This goes for everything, even simple stuff like I would never virtualize a Firewall or run Apps on TrueNAS or bother with CustomROMs or Jailbreaks. I am over 30 now, stuff like that I am so over with :wink:

1 Like
<driver name="qemu" type="raw" cache="none" io="native" discard="unmap"/>

Libvirt documentation says that cache="none" opens the zvol as O_DIRECT. I know that this bypasses the cache. I sometimes use oflag=direct with dd when tryng to measure speed or whatever, especially on servers with lots of RAM. Without it, the data gets written very fast to RAM and then dd freezes for a while until everything is written to the disk, even Ctrl+C does not work. It’s very annoying when I am trying to just measure speed or whatever.

It depends. I would not virtualize the main router/firewall, but other ones can work just fine. I work for a cloud service provider and we have a lot of VMs with pfsense or similar acting as routers for other customer VMs.

I dislike complicated things too, but sometimes it is unavoidable. Ideally, I would have separate hardware that I would use just for the node and so on, but I don’t. I’m trying to use the hardware for other things as well.

The way I understand it is this:
Node writes data to the ext4 filesystem as async. The data gets written to RAM, the fs call returns and the OS starts writing the data to the virtual disk. After some time the OS updates the filesystem metadata and syncs it to the disk. This then causes the data to be written as some form of sync in the host.

I think I found the reason for my load spikes when running the node without the sync setting. It’s very similar to this bug: 217965 – ext4(?) regression since 6.5.0 on sata hdd
However, for me, mounting the filesystem with stripe=0 does not really help.
Still, ext4_mb_regular_allocator, ext4_mb_good_group, ext4_get_group_info use a lot of CPU, so it’s really similar to what is described there. The bug looks like it was introduced in kernel version 6 (and then made worse in 6.5), so debian 10 does not have it.

Since I have a backup of the system partition, I can just revert to Debian 10 to avoid this.

EDIT: or maybe not, I have to investigate this further.

The way it looks with dd, adding oflag=direct either bypasses the write cache or makes the system flush it more often.

The alternative was zfs on top of zfs. I though about doing that and then decided it would be stupid. Then again, maybe it would have been better.

DD is another layer of complexity, because you could set something wrong or the program misbehaving. I recently had this problem with iperf. Could not achieve 25GBit. Thought this was a network error. Later on found out that the problem was iperf being single threaded.

ZFS on top of ZFS is definitely worse. But that is not the only alternative. Datasets over NFS works great.

Again, keep it as simple as possible. First run the benchmark directly on ZFS. Then you see that sync can never, ever, ever be faster than async. It is impossible. You also have established a baseline and see how much performance you loose with KVM. If it is too much, you know something is off.

Storj said that NFS was not recommended because of the problems with sqlite. Also, using NFS means one of two things - either the VM and the data are on separate servers or I am connecting a virtual disk from the host to the VM over network (oh and the VM and host are on separate vlans). So, that would be even worse than zfs on top of zfs.

It depends on how the program behaves. Really. For example - if I copy files, use dd or something else, async will be faster, even if it means that the file copy program will freeze for a while until the dirty blocks are written to the drives.
Storj node, however, wants to have low latency. If it happens that the IO system freezes for a bit while trying to flush all the dirty blocks (in my case it’s likely due to a bug in ext4 thugh) the satellite reacts by dropping the ingress, so a 1 minute or less freeze results in 9 additional minutes without ingress.

The Storj benchmark program shows over 100MB/s speed, but it does not take latency into account and how the satellite would behave in reality.

As or the freeze, try this with dd.
dd if=/big/file of=/test/path bs=1M status=progress
dd if=/big/file of=/test/path bs=1M status=progress oflag=direct
For best results, the /big/file should be larger than the amount of RAM. If the filesystem you use does not use compression or ignoring zeros, you can use /dev/zero as the source.

What you should see: without oflag=direct dd writes fast (indicated speed higher than the destination disk can do), then fills the write cache up and freezes for a while, then resumes writing. If you try to Ctrl+C out of it, it will freeze for a while before exiting.
With oflag=direct dd writes slower (at the speed of the destination disk), but there are no freezes. If you try to Ctrl+C out of it, it will exit immediately.

3 Likes

Sure. That is why I would leave the DB on the local disk but put the data on a dataset.

Not really. You can also access a local NFS share. Or even better if it is on the same machine, LXC can directly access datasets and avoid blockstorage entirely.

No! It really can’t. That is like saying my car can drive faster than the speed of light. AT BEST I can achieve speed of light with my car NEVER EVER faster!

That “freeze” time is the time you get with sync to begin with.

Again, my car can achieve the speed of light, never faster.

It was not possible at first, later Storj made it an option.

So let’s do a thought experiment. I start dd (or the node or any other program) and make it write in async mode. It writes as fast as it can (close to RAM speed), while the OS writes the data to the disk at some slower speed (usually disk is slowerthan RAM).
Of course since the program continues writing to RAM faster than OS can write to disk at some point RAM (or whatever portion of RAM used to store dirty data) fills up. What happens then? The program cannot write anymore, since there is no space in RAM where to put the data. So, it looks liek at that point the OS stops the program from writing until it can flush enough of the dirty data to disk (which may take a while) to free up some space in the buffer. The program freezes until that happens.
It does not really matter for something like rsync or dd. If it freezes, then it freezes, waits for a while and continues copying at whatever speed.
The problem for the node is that the satellite notices that the node is losing pretty much all races for the last few seconds and stops giving it new data. Only after 10 minutes or so, the satellite starts giving new data to the node again. So, even if the freeze lasted a few seconds, the drop in ingress lasts 10 minutes.

This can easily be seen with dd, especially if you use it to copy a file from very fast storage (or /dev/zero) to slow storage (HDD) and the file is bigger than the amount of RAM in the system.

2 Likes

Fine, let’s say we just look ZFS to make stuff easier.

Not exactly. Async writes will got into the TXG first, then written from TXG to the pool.
zfs_txg_timeout is by default 5 seconds, so the ZIO scheduler aims for that.

Expect the “no space in RAM” which I address later, pretty much yes.

And how do you think that would be different with sync?

With sync, the only thing that changes (bedsides the double write ZIL, but we will leave that out for simplicity) is that can’t profit from that write cache and get the worse speed of a full cache to begin with. If you have a SLOG, you can get (again for 5 seconds) the speed of the SLOG.
Then you are comparing SLOG speed with RAM. That is why at best, if your SLOG can handle the incoming sync writes, at the very best you get parity with async. Parity, never faster! This is even stated multiple times in the ZFS docs.

Let’s do this with real number. Let’s us assume we have constant 1GB/s incoming data.
Your RAM has 40GB/s. Your drives only 100MB/s. You add a SLOG with 500MB/s

For async: TXG will get that 1GB/s no problem in the beginning. But soon the cache will exceed the goal of only having the last 5 seconds in cache, since the HDD is so slow.
It will then slow down to the 100MB/s the drives are capable of. But remember, we still have 5GB in cache, thanks to TXG we would not have otherwise.

For sync: Writes go to TXG, but will only be reported back to be written after it really is on the pool, so speed is 100MB/s to begin with.

For sync with SLOG. For sync: Writes go to TXG, but will only be reported back to be written after it really is on the ZIL (SLOG), so speed is 500MB for the first 5 seconds, than 100MB/s.

Of course this would be the absolute worst case scenario for TXG. Even better would be:

  1. Some kind of wave breaker.
    Imagine incoming data for one second at 1GB/s, only having a window of exactly one second and then nothing for 10 seconds.
    Async would get 1GB every 10 seconds, sync only 100MB. Sync with SLOG would get 500MB.
  2. Simply faster.
    Imagine incoming data for one second at 1GB/s, then nothing for 10 seconds.
    Sync is faster, since the ack goes back when the data is in TXG, while for sync the ack comes after we copied the data from TXG to the ZIL. With a SLOG, you have ZIL on the SLOG, so it is slightly faster but still slower than RAM.

RAM is hopefully not the limiting factor here. If you have a 1Gbit NIC and ZIO really does a bad job so it has to hold 10 seconds of data, that would only be 1,25GB of RAM. If you have less than 1,25GB (remember ZFS can just shrink ARC) you have other problems :smile:

There is a similar logic to why TrueNAS sells their 500GB SLOG SSD overprovisioned to 16GB. Both ZIL and TXG will never be much bigger than your NIC speed * 5 seconds.

This works well for most programs, but may not work as well specifically for the node, more specifically with the new selection algorithm.
Or, at least how it works with ext4 (which is inside my VM).

Based on my experience with the node and other programs (where this does not really matter):
In the beginning, node would accept data at 1GB/s and write it to memory. The OS starts writing it to disk. After a while the OS notices “oops, the write buffer is almost full, I have to write the data to disk NOW and stop whatever is trying to overfill the buffer” and so it does. The node stops for a short time while the OS frees up the write buffer.
The satellite notices - “this node is too slow, it’s losing all races, stop all traffic to it for 10 minutes”.

Just found the quote:

Third, the ZIL, in and of itself, does not improve performance.

Source: To SLOG or not to SLOG: How to best configure your ZFS Intent Log - TrueNAS - Welcome to the Open Storage Era

My car can only reach speed of light, never faster :slight_smile:

That should be 100% the same for sync, just that the OS goes from from the beginning "oops, there no write buffer, I have to write the data to disk NOW and stop whatever is trying to overfill the nonexistant buffer”

BTW I am not denying that I could not work for you in your special :leafy_green: edge case.
But that is a ext4 problem and has nothing to do with ZFS or SLOG.
To me that is just further proof why you should try to avoid :leafy_green:

You’re quoting that out-of-context: as the very next words are “The ZIL sits in your existing data pool by default”. It’s only saying that you always have a ZIL, and that it’s normally just on your HDDs. And obviously any article about ZIL performance tuning is about having it on a SSD… and writes to ZIL-on-SSD are intended to be more performant that ZIL-on-HDD.

But yeah in normal operation ZIL’s are write-only: you’re only speeding up the logic that protects you from failures. Like all ZFS performance content: the first rule is always “Give ARC more RAM” :slight_smile:

(Edit: are we supposed to be over here?)

And instead of the node freezing for some seconds after writing gigabytes of data, it freezes for 1ms until the write of one file is flushed (and ends up in the SLOG of the host). 1ms may not make me lose a race, but a few seconds will.
That’s the difference.

Yes, and if the rates were higher, I could buy another server just for Storj, store the data without any virtualization, using zfs dataset etc.
Ideally I probably should buy a bunch of 1U servers and run multiple nodes, each on separate server with just a few (or even one) drives. Now that would be fast. I would need another rack though.

On the other hand, this situation is not so bad. I now know about the ext4 bug and may even learn something else, some way to optimize this. That may be useful for me elsewhere.

The way the new node selection algorithm works is that multiple failed races in a row result in up to 10 minutes with almost zero ingress.

So, in this case it would be:
A writes for 1 second at 1GB/s, then freezes for 9 seconds, then waits up to 10 minutes until satellite resumes sending more data.
B writes at consistent 100MB/s