How much "slower" is ZFS?

gingerbread233 · January 6, 2025, 11:18pm

Hello,

I would like to migrate my nodes (ext4) into a zfs pool with caching, because I think it’s easier to enlarge the pool with HDDs instead of cloning them to bigger ones. For example I have 3 nodes on 3TB HDDs each, if all drives are full, I would have to buy a larger drive and clone it, instead of throwing in another spare 3TB drive I have and expand the pool. I have a 1TB and 4TB ssd laying around, which I could use as a L2ARC Cache. Also I would have some sort of fail over when using RaidZ1, so if one drive dies, the node isn’t gone, and have to be redone. How much is the latency/loss in races compared to ext4?

Thanks in advance.

alpharabbit · January 6, 2025, 11:38pm

I have a linux server running ext4 and zfs nodes. All nodes are single disk where zfs disks having a small ssd partition assigned as l2arc for metadata only. Both are performing well and race wins are like 99 percent.

I do not recommend any setup with redundancy (raid). Storj doesn’t need it and it would be just a waste of space in my opinion. If a disk is full just start another node with the next disk.

Roxor · January 7, 2025, 12:55am

Basically I echo alpharabbit. Storj is fine with no redundancy. And ZFS isn’t slow: as Storj isn’t demanding. When it’s running normally it’s basically idling and performing tiny sporadic reads+writes. It’s only periodic housekeeping tasks (like startup used-space-filewalker) where it uses all your HDD iops (which zfs SSD caching tiers soak up beautifully).

Remember adding HDDs… means adding power. I’d rather fill a 10TB and then migrate to a 20TB when it fills. But I’m not your mom

Edit: If you just want online resizing: LVM will give you that with EXT4 as well.

striker43 · January 7, 2025, 7:18am

It sounds like you want to regularly migrate your node’s data from one drive to another (bigger) one?

Migrating using rsync on ext4 can take weeks while migrations using zfs send/receive are much much faster. I just did a comparison and transferred a 4 TB node (inside it’s own dataset) out of my RAIDZ1 (9x18TB disks + 2TB L2ARC) pool to a single 18TB ZFS drive.
With rsync it took around 2 weeks and with ZFS send/receive the same data was transferred in roughly one day.

gingerbread233 · January 7, 2025, 7:23am

More or less, I have a PC/Server case with place for 12 HDDs, I have several 3TB Drives laying around. Instead of buying new ones to enlarge each node, it would be easier, to put a drive in, I already have until the case is full, and add it to the pool to dynamically enlarge it, as the nodes are growing slowly.

striker43 · January 7, 2025, 7:24am

You could also just start one node on each of your drives, so you would not have to rebuild the pool and risk the data whenever your nodes grow out your drives.

Alexey · January 7, 2025, 2:40pm

If they would use LVM, it would take even less time than with zfs, because it doesn’t require several consequential send-receive, it’s a one time action, handled 100% automatically. At the end you will just remove the PV from the pool. You even do not need to change the node config at the end - everything is transparent and smooth.
Oh, no. You will need to modify either the docker run parameters or the config, because you would need to increase the allocation.
It doesn’t matter if you enabled a whole disk allocation feature

Roxor · January 7, 2025, 2:52pm

You can just add drives/vdevs to a ZFS pool as well, no send-receive, if that’s what you want. The data wouldn’t initially be balanced across all the space unless you did a send-receive… however… the same is true with LVM.

And as for removing older/smaller LVM PV’s… you’d need to pvmove extents from them first. Which takes time. Same time as ZFS would take doing the same thing with a ‘zpool remove’ command.

We all end up in the same place

Alexey · January 7, 2025, 3:05pm

no, it will rebalance, that’s the difference. However, it would be true only for RAID setups. For simple temporary mirror (this is how it’s implemented under the hood) of course no rebalance, if except that PV blocks would be consequent, so, some kind of a defragmentation…

But I would agree, it works almost the same if you would add to the STRIPE pool the disk and then remove the origin. For mirrored pool or raidz1/2/3… it will not work the same way, there are some frictions (in both systems, but they are different).

The huge difference between redundant arrays - zfs will recover the corrupted data because of bitrot, LVM will not by default. I tested. However, in my test env, LVM was in several times faster in the same setup. Of course, with default settings. No tricks like L2ARC or a metadata special device, and RAM only 2GB.

EasyRhino · January 7, 2025, 10:08pm

To echo others: ZFS is fine!

under normal, light traffic, the speed difference doesn’t matter. Under heavy test traffic, a L2ARC is greatly beneficial and nearly necessary (like other metadata caching solutions).

It looks like you have 3 nodes, one on each independent disk. that’s good.

If you want to expand, create another node. Don’t use any redundancy (it will pay you less money), don’t do any RAID0 (it will increase odds of failure). And the best performance is with disks operating independently.

Also bear in mind the max theoretical size of a node is a little over 24TB, and the max practical size is smaller because it’s hard to fill such a large disk with typical traffic.

Now the idea of getting a larger drive and doing a zfs send? Yes that can be much faster than doing a rsync.

lyoth · January 8, 2025, 4:46am

I had zfs, with ssd cache, and special metadata and it was slow for me.
I migrated everything to lvm with ext4, with a cache drive on it and it feels way faster.

On the pvmove, I recommend doing pvmove in chunk instead of doing 1 go. Like pvmove /dev/sda:0-10000 and so on, so if you have a reboot, you don’t have to start all over again. Just on the last chunk that did not finish.

Alexey · January 8, 2025, 7:05am

I tested, it works fine even after kernel panics and continues where is it stopped last time.
See

It’s easy to check in a test env:

You may terminate the VM (it’s better to use a VM for the test) in the middle of pvmove. Then boot and run it again, it will start from the last moved PV chunk.

So no need to complicate it.

lyoth · January 8, 2025, 9:42am

From my experience on Ubuntu, it would always restart around 10% or lower. This happened when I was not on ups and my power on this outlet would randomly go out for 1-2 seconds.
This happened multiple times, so I just wrote a simple seq command and piped it over to xargs to move 10000 blocks at a time.

Haven’t had the need to pvmove for a while since growth have slowed down, and trash problem fixed in one of the updates.

Alexey · January 9, 2025, 6:47am

Hm, I see. I didn’t have an experience like that that time, so maybe something has changed. I saw that it continued the replication after reboot, but not from 10%, otherwise I would be never able to finish it - it could restart several times during the process.

JWvdV · January 9, 2025, 8:05am

Just wait till your first drive (pv) of the LVM array fails. Probably many lvs that are corrupted by then. Besides, I thought we were on the path of using one disk for every node?

I myself use ZFS with special metadata duplicated on three SSDs (which are partitioned using LVM, in order to use them for many nodes). No L2ARC aside from in-memory, since no benefit with this setup.

In practice I never had so few glitches and such a stable/low workload since I switched to this, so don’t want revert back any time soon…

Don’t mentioning the pot calling…

Alexey · January 9, 2025, 12:00pm

Yes, I know, that LVM pools are much less resilient. But you know? They are faster and more convenient, and requires much less resources to feel that’s not crawling. I do not want to say, that zfs is bad, rather that it’s a best FS so far, but requires more resources (which is not required by storagenode).

exactly. With a basic requirements LVM+ext4 want much less RAM than zfs in the same conditions to be on par about IOPS. Especially for small files, you know.

Of course it could have a difference if you run more than a one node. I didn’t test this. But my guts suggests me that LVM+ext4 will be still faster in the same conditions (simple pool, no RAID, no metadata device). Please prove me wrong. The ZFS ability to recover some broken data will not be applied, because it’s a simple pool - no mirror, no RAIDZ.

Roxor · January 9, 2025, 12:30pm

Yes, this! Even a lot of memory for ARC doesn’t really help: as with Storj random IO spread over millions of files… you’re unlikely to have often-used or recently-used data in memory anyways. So low memory, no l2arc, no-ZIL-on-SSD is fine… if you have that SSD special metadata device. All the housekeeping tasks become instant.

alpharabbit · January 9, 2025, 3:40pm

With zfs you can specify what goes to arc and l2arc. Just set both to metadata only for the storj dataset.

Toyoo · January 9, 2025, 7:28pm

Remember that you can split and merge LVM arrays at will. You can do one LVM array per drive with a single physical volume each, and just for the duration of the copy, merge two of them.

Besides, LVM metadata is stored in a versioned way on each PV, so on failure of a physical volume, every logical volume that was not on the failed drive should still work. I once did recovery of an LVM array where one of the SSDs in the array went read-only; I barely noticed there was a failure

JWvdV · January 9, 2025, 8:11pm

But in that case, you can leave out LVM altogether to serve simplicity. And just dd at the moment you want to move a nice to another drive and expand the filesystem afterwards.