Why is no one discussing the LVM RAID + XFS bundle?

hatred · March 2, 2023, 9:47am

From my point of view, LVM RAID + XFS should be best suited for tasks like storagenode.

It is possible to combine disks of different sizes into LV RAID
It is possible to move volumes allocated to storagenode without stopping
Good fault tolerance - it is determined only for the set of physical disks that the logical volume occupies. That is, if you have 40 disks and 10 nodes, you can withstand the loss of 4 disks in LV RAID 5
SSD Caching

From all my tests, XFS is so far the best suited for nodes with a large number of files. Plus, this file system has optimizations for RAID.

I’m currently testing LVM + XFS in a virtual machine and I’m thinking of launching it into production to replace the hardware RAID5 of 8 HDD.

But it surprises me that there is no information on the operation of this bundle on a large number of small files. What to expect from the features?

BrightSilence · March 2, 2023, 10:09am

LVM, RAID and XFS have all been discussed on this forum. I recommend searching around for some user experiences with each. I’m going to stay away from the RAID discussion in general, because that’s been discussed to death, you can find all opinions on this easily. It’s a divisive issue, I’ll leave it at that.

LVM has uses outside of RAID as well though. It’ll definitely make migrations a lot easier and eliminates downtime during migrations, even if you use it on a single disk.

As I understand it XFS is slower on metadata operations than ext4, which wouldn’t be ideal for running a storagenode. Perhaps that can be tuned, I have no personal experience with XFS.

One note op performance with RAID setups. The bottleneck for storagenodes is usually iops. But with RAID, those will hit all disks instead of a single one. If you would instead run separate nodes on separate disk, each operation only hits a single HDD. This will end up being a lot faster for storagenodes.

SGC · March 2, 2023, 12:33pm

i know some have benched XFS as being one of the best performing file systems for storagenodes…

however afaik XFS is a very old filesystem, tho reading up on it a bit over wiki, it does seem as XFS has and is being continually updated to remain a modern filesystem.

all things are usually in balance… so i doubt the extra speed is free, i would rather expect it to be cutting corners some places where EXT4 isn’t.
that being said, RHEL using XFS as their default filesystem, is something that would make me tempted to use it.

but i’m using ZFS and have no plans on switching away from that, its been pretty good, even tho it is a bit of a beast.

do keep in mind that reach raid will have the same iops as single drive, because all writes and reads are stripes across disks in the raid and thus them working in harmony or as one… and so limiting their iops even tho the bandwidth goes up.

not sure how that makes any sense, raid 5 would mean 1 disk in fault tolerance, so to be able to lose 4 disks you would have 4 x 10disk raid5 in a stripe most likely.
which is usually called raid50

the big issue with running these large raid setups is the time it takes to replace your disks, for raid5 i wouldn’t recommend using more than 5 to 6 disks pr raid5 and 10 to 12 for raid6
if you go past those ranges resilvering / replacement of bad disks will take multiple months.

and really thats one of the most important things to consider when designing a raid setup.
you want to ensure that replacements of disks, doesn’t create long time frames with low or no redundancy.

the replacement / resilver time is higher the more disks you add to the raid, since all disks need to be read and the data processed.

like say comparing a 5 drive raid5 vs a 10 drive raid5.
for the 5 drive resilvering would mean reading the remaining 4 drives while in a 10 drive setup it would require reading 9.

on top of that there is the whole reason raid5 was more or less abandoned for raid6, which is afaik only solved by filesystems that use checksums.

the problem is this, if you have a disk giving out bad data in your raid5 you are down to no redundancy… but how do you know which disk is giving you the bad data…

in most cases only way is to use disk SMART to see which disk is reporting errors, but that might not always work or the disk giving bad data might have a good SMART statue while another disk that is providing correct data, can have a bad SMART status.
then the raid will overwrite the good data with the bad data, because without checksum’s it has no way to verify, what the correct data is.

with raid6 this problem is also avoided, not by using checksums, or in a sense it is checksums, but we call it something else… the parity data of the extra disk will leave the raid with the ability to see which disk is giving out bad data, because it won’t compute with the parity data.

this is why raid5 isn’t recommended for anything today, unless in rare cases like ZFS when it has additional checksum which allows it to detect errors and thus identify which disk is giving it bad data, even without needing raid6.

long story short.
raid and especially large raid setups are very complex and is usually designed with special purposes in mind, raid5 for most setups should be avoided because raid5 is fundamentally flawed.

unless if LVM raid5 fixes the issues, then it shouldn’t be used and one should use raid6 instead.

however raid6 usually being larger numbers of drives to reduce the redundancy ratio, then the max IOPS of the raid goes down, and Storj storage nodes are pretty IOPS heavy.

ofc that can be offset with cache which might take something like 50-80% of the write IOPS
the fact of the matter is that raid and storj doesn’t fit well together.

raid reduces IOPS and Storj needs many IOPS
ofc the reliability of raid is good, but it can also cause more problems than it solves if used incorrectly.

which is why i assume you are trying to move away from your 8 HDD raid5, that must have terrible performance.

but i digress…
oh and do try to keep the disks in a raid of the same model, else they tend to wear out quicker or cause problems, also don’t mix sata and sas in a raid

Pentium100 · March 2, 2023, 1:46pm

And then they all fail within a couple of days of each other, because they were manufactured identical, then were used identical so they fail identical.
At least for zfs I now prefer to use different drives in the same vdev. Even if this is slightly slower, hopefully the drives do not fail at the same time.

SGC · March 2, 2023, 3:09pm

so far im up to 3 disks that was dissimilar that has had to be replaced, sure it doesn’t always seem to be the case, but their failure rate sure seems to go up immensely.
aside from that there is also the performance considerations.

also doesn’t really seem that disks fail catastrophically, they usually just start causing errors.
but then again only like 5 years into using raid and so far have only run a total of about 50 drives.

i do push them pretty hard tho… lol
raid’s with low workloads often work fine with the drives that stop being useful for 24/7 high workload operation.

but i will agree that conceptually you should be right, and i guess its also a possible long term failure mode, i learned the thing about dissimilar disks when i was doing a deep diving into raid when figuring out what to do myself.
and now 5 years later i’ve seen it happen more than once…

if you really want your data to be safe you would be running mirrors

Pentium100 · March 2, 2023, 5:11pm

I have raidz2 of 6 drives. It is guaranteed to survive the failure of two drives. A 6 drive raid10 (or 3 mirror vdevs) is only guaranteed to survive the failure of one drive. A raidz2 of 4 drives has the same space, but higher resiliency than a couple of mirro vdevs.

There was a situation when a few drives (same model, bought at the same time, used in the same pool) failed in quick succession, so now I try to use different drives if their prices are similar etc. Another situation was when some Samsung SSDs started producing errors (apparently they have trouble with TRIM and NCQ combination) and were kicked out of the pool - thankfully the vdev was a 4 drive raidz2 and two remaining drives were different.

EasyRhino · March 2, 2023, 7:24pm

just remember the conventional recommendation is to spin up another node when you add a new disk. this way you wouldn’t need redundancy, if one disk dies, that node is just suspended and goes off operation, while other nodes are still running.

i think LVM could be useful if you think you might want to migrate a node from disk to disk though. Unfortunately LVM is still dark magic to me.

Alexey · March 4, 2023, 4:53am

It should be in the mirror setup, not stripe, 0 in RAID50 means RAID0, i.e. zero fault tolerance.
Perhaps @hatred meant 4 separate RAID5 arrays, 10 disks each, but in such case they can run only 4 nodes (it makes no sense to run multiple nodes on a one disk/array - they will work as a one node anyway but loosing the main advantages - to spread the load and reduce losses if disk going to die).

hatred · March 4, 2023, 6:00am

Nope.

Here you may find 12 Disk RAID5 survives 6 disk failures.

I found that LVM can do the same. This seriously changes the approach to scaling arrays, allowing you to reduce the number of spare disks in the array.
For example, now I have 3 RAID 5 arrays, and 3 disks are simply wasted. This technology will allow you to use 1 array with disks of different sizes. Of course, unlike ZFS, there is a less friendly set of commands, and in general there is very little information on this technology on the network. But I ran tests - I deleted 3 disks from 8 disks that had RAID 5 LV on them. The data was not lost, but of course there was free space for rebuilding LV and I deleted them sequentially, waiting for the rebuilding to end.

Alexey · March 4, 2023, 6:39am

This is not a traditional RAID5, it’s something different.
Could you please describe how do you want to make it with LVM only without their technology?

hatred · March 4, 2023, 6:49am

LVM allows you to create logical volumes based on extents, not disks or partitions. For every LV you may set RAID level independently. So you may have RAID 5, RAID 6, Mirror and Single Volume at same time on same amount of HDDs.

When HDD (Physical volume) removed from volume group, LV RAID rebuilds using free space at VG. So, if you have 8x8 Tb LV, and one HDD dies, you may rebuild it on 7x1Tb + 1x2Tb and you will even have 1Tb of free space for non-raid LV.

I think Huawei uses modified LVM, nothing new. HPE 3PAR and some Infortrend SANs do the same.

It is much more complicated than ZFS or hardware RAID, but it worth it.

Pentium100 · March 4, 2023, 11:43am

The number of disks that are “wasted” with RAID5, RAID6 or whatever is the number of disks that can fail and the array still survive. The reason is simple - you have your data and some redundant disks. Those “redundant” disks can fail and your data will be OK, but there is no way to fit, say, 4TB of data on 3x1TB drives.

With LVM, you can do this in a bit finer granularity than with traditional RAID or ZFS. The rule still applies though, so if you have 8x1TB drives and want the array to survive when 3 of them fail, you cannot have more than 5TB of data there. So, it is pretty the same as raidz3 in that regard.

Being able to use different size drives could be useful though.

hatred · March 4, 2023, 2:50pm

But I still wouldn’t compare ZFS and LVM. The first one is anyway, but the file system. I can keep STORJ nodes, torrents, and a media archive within it: deleted movies - left more space for STORJ. In the case of LVM, you need to allocate separate volumes for STORJ, movies and media library. And you can’t free up space so easily, even if you use LVM-Thin. So so far, LVM looks interesting only for constantly growing volumes purely under STORJ. And SSD caching is also used there exclusively for volumes, so if you have 1 SSD and a lot of nodes, this can lead to a problem.

BrightSilence · March 4, 2023, 3:59pm

So essentially this is just a way to have a hot spare in the active array. You can still use the same amount of data and lose the same amount to parity and if you want to have the “hot spare” functionality work you need to keep enough space free to rebuild, just like with a hot spare. It doesn’t really actually help with reliability as it still has to survive the rebuild, otherwise you’re screwed anyway. It’s a lot more flexible though, I’ll give you that. I like the tech, though I wonder about the performance implications as well. It could be really nice on larger arrays to have something like a 6+2 parity setup spread over 24 HDD’s. Though I’m not entirely sure LVM would even do that, but please correct me if I’m wrong.

The interesting thing is that this is actually not unlike how Storj works on a larger scale. If you consider nodes as HDD’s each segment as its own RAID array (technically RS encoded data). And it’s actually also very similar to how Backblaze (and I’m sure many others) spreads their data over several storage pods.

Toyoo · March 4, 2023, 7:08pm

Be careful with redundant RAID levels implementation by LVM. There are two points to consider here.

While it uses mdraid under the hood, you cannot tune it. And the defaults are rather unfitting to Storj—the defaults create a granular write intent map, which speeds up recovery, but slows down every single write. You can reduce the impact to a degree by setting --regionsize 64 (or a similar large value).
This implementation does not safeguard against silent errors, i.e. stored data that is not correct, but the HDD did not detect errors. Many consumer drives do not have good error detection. You then need an additional feature to detect these errors, like dm-integrity integrated into LVM. This, again, comes at a cost by slowing down all writes.

The above does not even consider larger stripe sizes of parity RAID levels, which should be taken into account at the file system level—and if you don’t, well, again, slower writes.

With these in mind you may actually get better results with zfs, or even btrfs, if you fail to do your homework.

hatred · March 4, 2023, 7:32pm

Thank you. Do you mean “64M” or 64K? Default is 512K

Toyoo · March 4, 2023, 7:40pm

IIRC the default unit if not given is mebibyte, so that would be 64 mebibytes.

hatred · March 4, 2023, 7:51pm

From what I see on Storj nodes, recording is not a problem at all. Both ZFS and XFS use dellayed write, and given the speed of the Internet, the array rarely writes more than 4 MB/s.

nyancodex · March 4, 2023, 7:52pm

cost? lol. the hell.

Toyoo · March 4, 2023, 8:15pm

The bottleneck is not in sequential writes, but random writes due to file system updates. I’ve written about the write path on ext4 here. I don’t have exact knowledge on XFS and ZFS, though I recall reading that XFS and default setup of ZFS would be worse than ext4 in this regard, and a careful setup of ZFS (one with an SSD cache) may be better.

Anyway, in a theoretical scenario of one 2.3 MB upload per second your array will indeed show around 3-4 MB/s of traffic only, but it will eat up 10-20% of available IOPS on a single-drive ext4 setup. In a theoretical scenario (though, I think we may be see it during some peaks of traffic) of five 500 kB uploads per second your drive will also show only 3-4 MB/s of traffic, but you will start getting latency problems due to insufficient IOPS.