Doh.. Looks like hardware failure coming up!

SGC · November 20, 2021, 3:44pm

stuff like stripe in ZFS isn’t fixed, recordsizes are dynamic only thing one does is set a max.

in regard to rebalancing, ZFS will rebalance data between different RAIDZ’s on the same pool, it will also dynamically sort io so that smaller writes go to more full RAIDZ’s inside a pool.
sure ZFS is a bit more difficult to work with because you expand by adding more RAIDZ’s or mirrors to the pool.

ZFS is very advanced and does a lot of stuff to make things run better, but its simply not designed to work optimally in small setups, but in larger setups it does seem to have a lot of advantages.

been using ZFS for 22 months now and i’m still learning how to best use it.
comparing ZFS to LVM is like comparing a tank to a racing car, it barely makes sense…
sure RAID helps make stuff more secure, but inject some bad data into the RAID and it can screw it all up.
only checksums can guard against this, so that the system can know which data is bad and which is good… else its just guessing based on smart data and the likes.

sure i might be a ZFS zealot, because i consider it superior to everything else, because i consider data storage to be about data integrity and ZFS is just the undisputed champ in data integrity.

Pentium100 · November 20, 2021, 9:21pm

It doesn’t (unless this is a new feature in the latest version). If I have a full pool and add a new vdev, the new vdev will stay empty unless I write more data to the pool. ZFS will not move half of the data from the full vdev to the new one.

SGC · November 20, 2021, 9:53pm

it doesn’t work like you think it does, over usage with normal deletes and the pool keeping to fill up the overall data is rebalanced based on io activity, record write sizes and the capacity of the different vdevs, thus giving your a near optimal usage of the hardware involved… sure it might take a while, but it works and is actually a pretty clever solution.

i duno when it was introduced i would assume it’s an old feature as data balancing over time is pretty critical for good operation of the hardware… not much point in having raidz’s sitting idle.

Pentium100 · November 20, 2021, 10:51pm

Yes, with writes and deletes, the data will be “rebalanced”. However, zfs does not do that for static data (imagine a pool that is written to but files are never deleted).

Some hardware RAID controllers can do this when expanding the array. It usually takes a long time, but after that, the data is distributed to all drives, even if files were not accessed during that time.

LVM can probably do that as well.

The difference is that for data that is not written frequently (or at all), read performance does not change when adding vdevs to a zfs pool, but it can go up when expanding a hardware RAID1 array into RAID10.

SGC · November 21, 2021, 9:48am

but if one wanted read performance wouldn’t one add l2arc or memory, but yeah i know that is missing, but not sure how often its really useful in real world environments.

also can you even do that… since records are not striped across raidz’s but written or read from a single raidz or mirror vdev.
the data simply would never exist on multiple vdevs like that, if one wanted load balance or such it might be easier done on a network level.

i like how my zfs moved around the io and the records written depending on which of my raidz’s are performing the best… had a disk act up a while back and then zfs simply moved the worst of the writes to a raidz1 with less capacity free until i got the problem solved.

but since a record only exists on 1 raidz then i don’t think balancing is possible.
on the flip side zfs can have much more io because it doesn’t stripe across all the disks.

Alexey · November 21, 2021, 10:03am

The zfs not only not do the rebalance, you cannot change the stripes number, this is the only way to increase a performance when you add a disk to the classic array with parity. Other competitors have this ability, at least LVM allow you to convert your array to use 3 columns instead of 2 when you add a new disk and improve the performance this way.
However, it is coming in cost of degraded performance during conversion.

Since zfs was not designed for home usage where the degraded performance during conversion will not be a show stopper, it doesn’t have this feature and this is normal and expected.

SGC · November 21, 2021, 10:23am

in raid adding more disks to a raid only gives you more bandwidth, the iops remains the same and it’s generally always the iops that becomes the limitation, but yeah you cannot add disks to a raidz because the “stripe size” is dynamic.

if one wants to increase performance by adding less drives the recommendation is to use mirrors, due to the inherent io limitation of raid since all disks in a raid will have to work in sync and thus, its not possible for a raid to go beyond the io performance of a single disk, even if it can have much higher bandwidth, but this still only have limited application.

yes it would be nice if ZFS had this fundamental feature in normal raids, but their ability to do this also locks their stripe sizes meaning there is a limit on the minimum filesize.
often something like 64kb or more, while zfs can go all the way down to the sector size multiplied by the number of disks…

something like ceph will be able to increase io performance, but in practice it has proven difficult atleast according to the lecture i saw from cern about how they setup their ceph.
but with a replication based setup… not that ceph has to be replication, it can be erasure coding… but in a replication system one gets the advantage of more io when adding more disks and one can specify which files one wants to be more performant and which one are less worried about loosing…

replication vs raid are two very different systems… raid is not made for performance its made for being economical and redundant data storage, while replication based setups can in theory have near infinite performance, somebody wanting to do high performance setups rather than economical would never go raid to begin with.

Toyoo · January 29, 2022, 2:54pm

Note that LVM mirrored raid (unless you use the outdated --type=mirror stuff) uses default mdraid settings. mdraid, if set up manually, can be tuned, e.g. by placing its write intent bitmap elsewhere or making it less granular. Otherwise, performance is less than stellar.

I’ve made some tests on 2×NVMe drives, and the defaults reduced write throughput even tenfold in some scenarios, while making the bitmap less granular made it close-ish to raw disk performance.

LVM raid actually has support for integrating dm-integrity. Try lvcreate’s --raidintegritymode switch. Obviously, with performance impact.

As an alternative if you also encrypt data, LUKS2 can do integrity as well. If you encrypt LVM’s physical volumes, you may no longer need filesystem-level or LVM-level integrity checks.

Alexey · January 29, 2022, 3:01pm

my LVM has version

 LVM version:     2.03.07(2) (2019-11-30)

And there is no raidintegritymode

Toyoo · January 29, 2022, 3:31pm

Mine has it, LVM version: 2.03.11(2) (2021-01-08). Debian Bullseye.