Performance: Optimal RAID Stripe Size And Read Ahead For HDDs In RAID 50?

SGC · December 30, 2021, 4:12pm

i’m not a fan of raid5 type setups when not having checksums on the files, because a raid5 cannot determine the place of the error and thus has to guess which data is correct.

with raid6 this is less of a problem because they will in most essential terms vote, if a disk gives incorrect data, the two other locations of the data will prove which one provides corrupt data.
still not as secure as having checksums, tho it has a higher fault tolerance.
however with raid6 the need to array size goes up… a raid5 setup could work with 3 and only loose 1/3 data capacity to redundancy, however for raid6 to equal that you need 6 drives of which 2 are redundant.

because the max iops for a raid array is basically 1 disk worth, even if the higher bandwidth can help get a higher avg iops and throughput, the array is still get much less max iops in relation to how much hardware goes into it.
the max iops of a raid array is the same as a single disk because all the disk work in harmony, 1 write is a stripe across all the drives.
raid isn’t great for storagenodes, but it does limit the errors and would in most cases extend the potential life of a storagenode, atleast currently…

both software and hardware raid have their own unique advantages.
hardware is much more streamlined and easy to manage, but software ensures that the raid will survive the death of the hardware raid device and gives one the options of adding new features as the software develops.

i really like ZFS as a raid solution, it’s an amazing piece of software and would argue that aside from ease of use and speed of the hardware raids, ZFS is vastly superior to most if not all current hardware raids… tho i’m sure they will copy it eventually if they aren’t already.

svet0slav · December 30, 2021, 5:44pm

Was really wondering whether to use RAID 60 as there is greater fault tolerance over RAID 50, sacrificing a bit of the read performance and capacity. If more than 1 disk fails on each array… Opps! Both do not get any write performance compared to RAID 10, though. I am not fond of software RAIDs. I would not risk running the OS on a SD card, simply to be able to form a software RAID or sacrifice 2 disk slots for SSDs to install the OS on, so the rest can be used for less storage capacity. Depends on what hardware you use. I use HP ProLiant 380p Gen 8 servers with 12 LFF HDDs of 4 TB with their integrated P420i with 2GB FBWC RAID card and they also have the P822 2GB FBWC with HBA mode enabled, so HP D2600 MSAs with 12 LFF HDDs of 4 TB each get attached to them in dual domain on both I/O modules.
Have some old HP ProLiant 380 Gen6 and wondering whether to get some P411 with 1GB FBWC and add more such MSAs to them, too. The way I see it - sacrificing 2 IPs and 2 network ports per MSA attached on a server is worth it. Let’s see what else is there in the home lab…

SGC · December 30, 2021, 5:59pm

pretty sure i would just throw them all / most into one large raid6 then
it all comes down to that capacity loss… its just so high when doing smaller arrays.
12 drive is a bit on the upper limit of what is a good idea, but with 4TB HDD’s it wouldn’t bother me one bit…

but sure wouldn’t like to do that with like say 18TB drives.

mirror are pretty nice, they have about the same write iops as 1 raid array and double the read iops…

but yeah i like to reduce the capacity loss with raid arrays… i run 5 x Raidz1 which is basically 5 raid5’s with 6 drives in each, but with checksum in ZFS i don’t have to worry about data corruption or such issues…
ofc it still gives me bad nerves when a disk goes down for whatever reason… and it happens more than i care to admit lol

ofc you could also start out smaller and then expand as needed only really older nodes required raid for life extension / security… i would hate to lose my node after it being 2 years old now

svet0slav · December 30, 2021, 6:12pm

Exactly. So maybe I would go for the RAID 60.

Should be enough for 2 nodes - 16 TB each.

This is more like it - 2 nodes 20 TB each, but the fault tolerance is lower.

svet0slav · December 30, 2021, 6:29pm

Actually, I could not find larger 6Gbps SAS-2 drives. It is what the RAID controller supports…

SGC · December 30, 2021, 6:42pm

thats odd usually when you are above the 2TB limit you can generally keep going up, ofc there might be some limits because the hardware in the raid controller might have limited memory or indexes or whatever…

but with raid 60 rather than just a regular raid6 you have 4 disks rather than 2 that are redundant…
then you are up to using 1/3 of your capacity on redundancy, and you just get twice the iops…
while on the regular raid6 on 12 hdd’s you use 1/6th of your disks on redundancy.

might not seem like much… but think about it this way… with 12 drives and 4 as redundancy you have 8 drives worth of capacity… so reducing the redundant drives by 2 gives you 10 hdd’s so a 25% increase in capacity.
sure it costs you double iops… but if you really want iops then stuff like raid 10’s would be superior
but again with the costs of SSD’s these days, doubling the costs of the storage, they quickly get into or near the same costs… so really sacrificing for iops… hdd’s is for capacity, its what they are good and affordable at… one really needs to make use of that. imo

svet0slav · December 30, 2021, 7:50pm

Maybe you did not understand. I was referring to could not find larger capacity disks at a good price. I get these for ~$45 each - nowhere near the SSD prices. Higher capacity HDDs (over 4TB), if any have insane prices.

This is why the thread was about RAID 50 performance. Is what I intend of using on new nodes. Already using on some nodes and performance seems OK. I even have a 8TB node with RAID 5 on a 4 LFF 1U server, which is also used for other stuff. Performance there is great, too. It is actually one of my most productive nodes.
RAID 60 is more secure, but with less capacity and I believe pointless redundancy. It is very rare that more than 1 disk from an array of RAID 50 fails. And usually when they begin failing, you know in advance - the servers management ports warn when drives are about to fail, so they get replaced on time before they get totally screwed. Some screwed up drives could even be used further in home lab environments for testing. Low level format and Dmitry Postrigan’s MHDD can do wonders to such drives.

SGC · December 30, 2021, 7:54pm

with raid5 it’s not really a disk failure that worries me… its bit rot or just plain random errors for whatever reason… nobody should be using raid5 today… its simply unsafe
its mirrors or raid6

if you inject bad data into a disk on a raid5 array the raid array will most often die… not so for raid6

svet0slav · December 31, 2021, 3:53am

If one does not know how to protect a server, then maybe - yes.

svet0slav · December 31, 2021, 5:13am

I kind of agree with this now that I think about it again. In this case it is more than 4 disks (12 to be exact) and is better to use RAID 6 over RAID 50. Would not use RAID 5 for sure… RAID 5 is OK with 4 disks, but with more than 4 disks - RAID 6 (preferred) and RAID 50 are better.

RAID 50 - At least 1-drive failure. One disk from each RAID 5 set can fail without data loss.
RAID 6 - 2-drive failure can fail without data loss, no matter which 2 drives.

Both have same capacity and speed benefits - sacrificing 2 disks for parity and 10x read speed, no write speed gain, except with RAID 6 it does not matter which 2 drives fail and the system keeps running without hangs or data loss, which is key for decision making on what RAID to set up. I would not go for RAID 60, though…

SGC · December 31, 2021, 8:34am

raid 5 is today considered unsafe because if bad data gets in to disk of an array the raid array has a 50/50 chance of dying…

that is why nobody recommends using raid5 or raid50 but i just call it raid5 because a raid50 is just a stripe of two raid5, so yes it will give better read and write iops, than a pure raid6 with the same hardware.

disk failures aren’t really a big worry when running raid, because if a disk just fails the array easily survives, its errors in the existing data for whatever reason that becomes the problem, like shown in the video i linked a few posts ago.

wendel (the guy in the video) shows how a raid5 will die from replacing good data with bad data on a single drive, even tho the data still exists and can be recovered, the raid5 array is simply not aware enough that it can see that.

raid5 is flawed and should never be used for extended periods.
i’m sure there are newer versions that has solved that problem, but using checksums or whatever… but without that the odds of the data surviving long term might be better on a single disk… sure there will be an error here and there from time to time… at the rate of about bit pr TB written.
but the data survives with the error and only upon a complete disk failure is the data lost…
with a raid5 the data survives a disk failure (something which is fairly rare)

but can die from something as simple as a bad cable connection or whatever else can cause data errors.(something that is extremely common)

thus the chance of a raid5 surviving long term is actually less because it depends on more disks, because it is more vulnerable to the most common types of errors, since it has many disks/cables/and whatever else to cause them.

and trust me you can just look the wrong way on a hdd and it will throw you errors, they are
the modern magnetic version of vinyl records.

just put in 6 new disks in my ZFS in the beginning of December and i am still trying to figure out why one of my new disks keep throwing me errors, currently hoping having moved it to a different bay fixes it…

i’ve seen so many errors on my ZFS pool stuff that my hardware raid never really told me about in the past, without checksums my raid would have been dead over a year ago now.
if not sooner because of it being basically a raid5 with checksums

sure raid6 doesn’t give you checksums, but it gives you double parity which is almost as good, because if one data point of 3 is wrong… (1 +2)storage + redundant disks
then there will still be 2 right again 1 wrong.
while with raid5 if one data point of 2 is wrong… (1+1) storage + redundant disks
then its 1 again 1 to know which is wrong…
so like wendel explains in the video i linked, it will consult smart data and simply guess based on that… do you really want your data stored in a way where it basically flips a coin to figure out which data is right and which is corrupt.

sure it works pretty okay, if you always replace broken drives immediately, if a drive gives you bad smart, then its not a bad indicator, but when using older drives which smart can be pretty messy on, well the it becomes exceedingly dangerous for your data to live on a raid5.

you can try to rationalize it however you like, but its a common known fact that raid5 is unsafe, atleast the older types, i’m sure there are good modern versions like raidz1 (zfs version raid5)
due to checksums and whatever else.

svet0slav · December 31, 2021, 2:09pm

No. It won’t.

And how do you expect this to happen? I would not do it on purpose. My networks and servers are quite well protected. You can’t get in. I even hired companies to try. They failed. Badly… The only way to hack me for now is… OK 2 ways… If you are a ghost or some insane mage and posses the hardware (not joking), or if you have QPU (Quantum Processing Unit), which is not widely available and is not what is being displayed to the public at all… Though, things change, hardware becomes irrelevant and so on, for now I am OK. We could discuss more on this on Discord or separate threat - like how people secure their networks and nodes. Would be great.

The bottom of the rack is full of cold spares just in case, so I guess that is covered.

You got me there. See the examples above? I would rather go with RAID 6 now in that case. Cannot say the same about machines with 25 SFF disk slots… There a RAID 60 with 5 parities of 5 disks would make more sense over RAID 6 or RAID 50, or even RAID 10+spare.

svet0slav · December 31, 2021, 4:55pm

So back to the question… A block size of 64 KB is sufficient for most RAID configurations. How much approximately is the average STORJ node write size after all? We could figure out the best possible stripe size easily then. The block size can be optimized when set to that much, but divided by the number of non-parity disks in the RAID array, thus each write is parallelized across all available non-parity disks. STORJ staff, please??