ZFS speed and optimizations for VM

Pentium100 · June 23, 2024, 12:20pm

Debian 12 VM on qemu/kvm hypervisor (also Debian 12). The virtual disk is a zvol on the HV and formatted as ext4 inside the VM. The pool is made up of 3 RAIDZ2 vdevs, 6 drives each and has a SLOG (mirror of SATA SSDs). When the node VM was Debian 10, there were no load spikes. I can still go back to Debian 10 if I decide to. I upgraded to Debian 12 to see if it would change the performance (different caching etc) and it did, just not in the direction I wanted

I do not want to dedicate that server to running just the node, I also want to run my own stuff on it, since it has enough free space to run some of my VMs. Also, any trick I learn to make the VM run faster would be useful for me elsewhere, so it’s no so bad.

Pentium100 · June 23, 2024, 1:03pm

cache=“none” on the virtual disk settings

Yes, the zvol is passed to the VM as a block device, so the format is raw.

Normally I prefer 6-drive RAIDZ2 over 3x mirrors because RAIDZ2 is guaranteed to survive two failures, but while 3xmirror can survive up to 3 failures, it can also die after two failures if I’m unlucky (both copies fail).
Also, before recently, Storj never had this much traffic. If I get the opportunity to borrow some hard drives I’ll remake the pool as mirrors. I have a spare drive, so I’ll be able to quickly rebuild the pool if one drive fails, hopefully before another drive fails, no need to wait a while for a new drive to arrive.

Writes from VMs are sync (O_DIRECT bypasses write cache, so it makes use of SLOG) and having a SLOG helps with that. SLOG is a mirror because if there is a crash and the SSD fails I would lose my data. Using a mirror for SLOG is recommended IIRC. L2ARC does not need redundancy (I’m not using it here, ARC gets high enough hit rate).

I have other VMs there as well, The server is not dedicated to Storj.

Pentium100 · June 24, 2024, 7:37am

Yes, I meant that I could arrange 6 drives as a single RAIDZ2 vdev or three 2-drive mirror vdevs. My drives are the same model, when I bought them (3 separate times, 3 different capacities) those drives, while still being server-grade, were cheaper compared to others.
Using different brand drives would be better for reliability, or at least getting drives from different batches.

It possibly was the scheduler. On Debian 10, using mq-deadline inside the VM probably helped, but using it on Debian 12 probably caused the load spikes. I also changed the scheduler to none on the host drives (apparently zfs has its own scheduler).

Or maybe not, maybe there is something else. The benchmark program shows 150-190MB/s, but it does not take delays and selection algorithm into account. The new algorithm makes it difficult to figure out if my node is working as it should.

Looking at iostat, the write speed that shows up on the virtual disk inside the VM, also shows up on the SLOG SSDs.
So, if I understand this correctly, when the node writes data in async mode inside the VM, the data gets stored in RAM for some time and then written to the virtual disk. When this happens, the host probably sees those writes as sync, so it uses SLOG.

Pentium100 · June 24, 2024, 5:40pm

<driver name="qemu" type="raw" cache="none" io="native" discard="unmap"/>

Libvirt documentation says that cache="none" opens the zvol as O_DIRECT. I know that this bypasses the cache. I sometimes use oflag=direct with dd when tryng to measure speed or whatever, especially on servers with lots of RAM. Without it, the data gets written very fast to RAM and then dd freezes for a while until everything is written to the disk, even Ctrl+C does not work. It’s very annoying when I am trying to just measure speed or whatever.

It depends. I would not virtualize the main router/firewall, but other ones can work just fine. I work for a cloud service provider and we have a lot of VMs with pfsense or similar acting as routers for other customer VMs.

I dislike complicated things too, but sometimes it is unavoidable. Ideally, I would have separate hardware that I would use just for the node and so on, but I don’t. I’m trying to use the hardware for other things as well.

The way I understand it is this:
Node writes data to the ext4 filesystem as async. The data gets written to RAM, the fs call returns and the OS starts writing the data to the virtual disk. After some time the OS updates the filesystem metadata and syncs it to the disk. This then causes the data to be written as some form of sync in the host.

Pentium100 · June 24, 2024, 7:58pm

I think I found the reason for my load spikes when running the node without the sync setting. It’s very similar to this bug: 217965 – ext4(?) regression since 6.5.0 on sata hdd
However, for me, mounting the filesystem with stripe=0 does not really help.
Still, ext4_mb_regular_allocator, ext4_mb_good_group, ext4_get_group_info use a lot of CPU, so it’s really similar to what is described there. The bug looks like it was introduced in kernel version 6 (and then made worse in 6.5), so debian 10 does not have it.

Since I have a backup of the system partition, I can just revert to Debian 10 to avoid this.

EDIT: or maybe not, I have to investigate this further.

Pentium100 · June 25, 2024, 3:37am

The way it looks with dd, adding oflag=direct either bypasses the write cache or makes the system flush it more often.

The alternative was zfs on top of zfs. I though about doing that and then decided it would be stupid. Then again, maybe it would have been better.

Pentium100 · June 25, 2024, 6:05am

Storj said that NFS was not recommended because of the problems with sqlite. Also, using NFS means one of two things - either the VM and the data are on separate servers or I am connecting a virtual disk from the host to the VM over network (oh and the VM and host are on separate vlans). So, that would be even worse than zfs on top of zfs.

It depends on how the program behaves. Really. For example - if I copy files, use dd or something else, async will be faster, even if it means that the file copy program will freeze for a while until the dirty blocks are written to the drives.
Storj node, however, wants to have low latency. If it happens that the IO system freezes for a bit while trying to flush all the dirty blocks (in my case it’s likely due to a bug in ext4 thugh) the satellite reacts by dropping the ingress, so a 1 minute or less freeze results in 9 additional minutes without ingress.

The Storj benchmark program shows over 100MB/s speed, but it does not take latency into account and how the satellite would behave in reality.

As or the freeze, try this with dd.
dd if=/big/file of=/test/path bs=1M status=progress
dd if=/big/file of=/test/path bs=1M status=progress oflag=direct
For best results, the /big/file should be larger than the amount of RAM. If the filesystem you use does not use compression or ignoring zeros, you can use /dev/zero as the source.

What you should see: without oflag=direct dd writes fast (indicated speed higher than the destination disk can do), then fills the write cache up and freezes for a while, then resumes writing. If you try to Ctrl+C out of it, it will freeze for a while before exiting.
With oflag=direct dd writes slower (at the speed of the destination disk), but there are no freezes. If you try to Ctrl+C out of it, it will exit immediately.

Pentium100 · June 25, 2024, 10:40am

It was not possible at first, later Storj made it an option.

So let’s do a thought experiment. I start dd (or the node or any other program) and make it write in async mode. It writes as fast as it can (close to RAM speed), while the OS writes the data to the disk at some slower speed (usually disk is slowerthan RAM).
Of course since the program continues writing to RAM faster than OS can write to disk at some point RAM (or whatever portion of RAM used to store dirty data) fills up. What happens then? The program cannot write anymore, since there is no space in RAM where to put the data. So, it looks liek at that point the OS stops the program from writing until it can flush enough of the dirty data to disk (which may take a while) to free up some space in the buffer. The program freezes until that happens.
It does not really matter for something like rsync or dd. If it freezes, then it freezes, waits for a while and continues copying at whatever speed.
The problem for the node is that the satellite notices that the node is losing pretty much all races for the last few seconds and stops giving it new data. Only after 10 minutes or so, the satellite starts giving new data to the node again. So, even if the freeze lasted a few seconds, the drop in ingress lasts 10 minutes.

This can easily be seen with dd, especially if you use it to copy a file from very fast storage (or /dev/zero) to slow storage (HDD) and the file is bigger than the amount of RAM in the system.

Pentium100 · June 25, 2024, 12:38pm

This works well for most programs, but may not work as well specifically for the node, more specifically with the new selection algorithm.
Or, at least how it works with ext4 (which is inside my VM).

Based on my experience with the node and other programs (where this does not really matter):
In the beginning, node would accept data at 1GB/s and write it to memory. The OS starts writing it to disk. After a while the OS notices “oops, the write buffer is almost full, I have to write the data to disk NOW and stop whatever is trying to overfill the buffer” and so it does. The node stops for a short time while the OS frees up the write buffer.
The satellite notices - “this node is too slow, it’s losing all races, stop all traffic to it for 10 minutes”.

Roxor · June 25, 2024, 12:54pm

You’re quoting that out-of-context: as the very next words are “The ZIL sits in your existing data pool by default”. It’s only saying that you always have a ZIL, and that it’s normally just on your HDDs. And obviously any article about ZIL performance tuning is about having it on a SSD… and writes to ZIL-on-SSD are intended to be more performant that ZIL-on-HDD.

But yeah in normal operation ZIL’s are write-only: you’re only speeding up the logic that protects you from failures. Like all ZFS performance content: the first rule is always “Give ARC more RAM”

(Edit: are we supposed to be over here?)

Pentium100 · June 25, 2024, 1:09pm

And instead of the node freezing for some seconds after writing gigabytes of data, it freezes for 1ms until the write of one file is flushed (and ends up in the SLOG of the host). 1ms may not make me lose a race, but a few seconds will.
That’s the difference.

Yes, and if the rates were higher, I could buy another server just for Storj, store the data without any virtualization, using zfs dataset etc.
Ideally I probably should buy a bunch of 1U servers and run multiple nodes, each on separate server with just a few (or even one) drives. Now that would be fast. I would need another rack though.

On the other hand, this situation is not so bad. I now know about the ext4 bug and may even learn something else, some way to optimize this. That may be useful for me elsewhere.

Pentium100 · June 25, 2024, 3:20pm

The way the new node selection algorithm works is that multiple failed races in a row result in up to 10 minutes with almost zero ingress.

So, in this case it would be:
A writes for 1 second at 1GB/s, then freezes for 9 seconds, then waits up to 10 minutes until satellite resumes sending more data.
B writes at consistent 100MB/s

Pentium100 · June 25, 2024, 3:36pm

It’s not how it works though. The way it works is this:

Satellite needs 110 nodes to give to the customer.
It selects 110*n (I do not know the current value of n) nodes and splits them into 110 groups n nodes each.
It compares the success rate (how many races the node won/lost in the last few uploads) and selects the best node in each group.
Customer starts 110 uploads, finishes 80 and reports to the satellite. 80 nodes the an increased score and 30 nodes get a reduced score.

As I understand, a few lost races in a row mean greatly reduced traffic. If the score drops too low, the node is very unlikely to be selected again, in this case the satellite resets the score after 10 minutes.

For this I changed the setting so garbage collector runs every day instead of every hour. It does not matter that it now runs longer, but the wait period after it finishes is pretty much the same and it happens once a day instead of 24 times.

Here’s an example. Garbage collector runs (this was before I changed the setting), IO increases for a minute and then drops, but the traffic does not recover for longer

striker43 · June 25, 2024, 5:01pm

How exactly did you do that?

Pentium100 · June 25, 2024, 6:42pm

in config.yaml file

# how frequently expired pieces are collected
collector.interval: 24h0m0s

Pentium100 · June 25, 2024, 8:15pm

With sync and no long freezes the success rate stays high until IO is overloaded, then it drops a bit, the traffic drops a bit but it basically stays constant.
With async the success rate is initially higher, but then drops a lot causing the traffic to drop a lot and not recover for a while. The recovery period is longer than the freeze period, so the node finishes writing the data and sits with idle IO for some time.

Unless the OS decides to first write a lot of data to the disk to free up the buffer before allowing the node to continue.
The problem is not the average performance over an hour. Yes, over an hour, a program that writes 1GB/s for 2 seconds and freezes for 15 seconds would be faster than a program that writes 100MB/s continuously. However, the satellite is not interested in average performance in an hour, all it sees is that node B is occasionally losing a race, node A loses all races for 15 seconds every so often.

It’s kind-of behaving like TCP does with a connection that has intermittent packet loss. In theory, a 10gbps connection that has 100% packet loss for 2 seconds every 2 seconds can transfer more data than a 100mbps connection with no packet loss, but TCP would most likely transfer more data on the 100mbps connection.

Pentium100 · June 25, 2024, 8:43pm

And a 10gbps connection with 50% packet loss can transfer 5gbps, right? So why does TCP work so slow on connections with high packet loss?

Success rate is not 60% for both, just like 10gbps with 50% packet loss is not 5gbps.
Success rate is not how many chunks get in from the customers per minute vs how many chunks are saved y the node. Success rate is how many chunks are given by the satellite to the node vs how many are saved by it. That is an important difference.

The satellite initially gives, say, 0.1 chunks per second to each node. Both save them OK. The satellite then increases the speed to 1 chunk per second. Both nodes work OK. Then the speed goes up to 1.1 chunks per second. Node B starts failing some of the requests, so its rate gets reduced to 1 chunk/second and stays there. Node A continues. Speed for it goes up to 2 or 3 chunks/second, it works like that for a few seconds, but then its speed effectively goes to zero for 9 seconds, failing all uploads. Satellite waits a few minutes and starts giving node A 0.1 chunk per second, slowly increasing the rate until the node stalls and this happens all over again.
Meanwhile, node B is getting its 1chunk/second. The satellite tries to increase the rate a bit, the node fails some (not all) requests and the rate goes down to 1 chunk/second.

Pentium100 · July 1, 2024, 10:20am

I have found something new. After adding ioeventfd to the configuration of the virtio-scsi controller, it looks like the node is running a bit faster (or rather the IO load of the virtual disk is lower). It still does not want to run with sync disabled (load spike as dirty data reaches 4GB, the node stalls and the traffic drops), but that could be some kind of bug or something that’s “new” in Debian 12.
I tried returning the node VM to Debian 10, but the node then started crashing, so I went back to Debian 12.

I also ran fsck on the virtual disk, it found and fixed a few problems, mostly of the type

Inode 388701211, i_blocks is 0, should be 8.  Fix? yes

So maybe that also made a difference in performance.

The big gap around 18:00 is when I changed the settings and ran fsck. After that I ran the benchmark and then the node without sync enabled.
The dip at 08:00 is when I turned sync back on. after seeing the dips that happened during the night.
IO load on the virtual disk

IO load on one of the hard drives of the host:

Load average of the VM (log scale because of the spikes):
graph_image-3

As far as I know ~120mbps is currently the expected ingress, so, at least for now, my node appears to be able to handle it.