Zfs discussions

SGC · April 28, 2020, 9:34pm

if you are that worried about it, why not just run raid 10
then its unlikely to fail in like a million array years, you only add 33% cost for going from a raid 6 with 6 drives and to a raid 10 with 8 drives, while keeping the same capacity…
you can add the drives in pairs, which gives you the ability to use many different types of drives, and you can scale and resilver with no trouble at all…
and because its mirrors its basically backup… even tho it is best to have one offsite also xD in case of lightning/fire/natural disasters… and what not…

anyways, just a thought… lol now i’m sitting here convincing myself to do a striped mirror…
also the read and io on raid 10 is insane… sure raid 6 gets more sequential write speed… but really… with like 4 drives you will basically gets double single disk write speed and 4x read and the same multiplier on the IO i believe.

Pentium100 · April 28, 2020, 9:46pm

RAID10 can be less reliable than RAID6.
4 drive raidz2 is more reliable, but slower than 4 drive RAID10, both have the same capacity).
6 drive raidz3 is more reliable, but slower than 6 drive RAID10.

raidz-x is guaranteed to survive x failed drives, while y drive raid10 is only guaranteed to survive 1 failed drive and has 1/(y-1) chance of failure after a second drive fails.

6 drive raid10 has 20% chance of failure when a second drive fails. 8 drive raid10 has 14%. 6 drive raidz2 has 0% chance of failure when a second drive fails.

RAID is never a backup.

That’s nice, but backing up the node is not possible.

SGC · April 28, 2020, 10:04pm

i think you forget to take resilvering time into consideration…
granted with low numbers of drive arrays like we both are using it’s not a huge consideration.

Pentium100 · April 29, 2020, 3:39am

A 13 drive raidz2 (or rather the server has two raidz2 vdevs - 10 drive and 13 drive) resilvering time is rather fast as well. I do not think a raid10 pool would be any faster. Maybe. I have not tested this.
The more drives you use in a raidz type vdev the less reliable it gets, because the chance of two failed drives in a 100 drive vdev is much greater than the chance of two failed drives in a 6 drive vdev. At the same time the probability of raid10 failure when a second drive fails goes down.
Of course in a raid10 vdev after one drive failure the chance the the “wrong” drive will fail is a bit higher than the chance that some other drive would fail, especially in a read-intensive setup, because the remaining drive gets more reads.

SGC · April 29, 2020, 6:16am

i would like to move towards getting 2 redundant drives also, and then maybe have a global hot spare supporting a few vdevs.
but to keep down costs i decided to just do raid 5… with 5 drives i don’t really feel like i got another choice, but if i had like 10 drives vdev or more then i would moved towards 2 in redundancy.
1-2% annual failure rates on a drive, so 5-10% yearly odds of a failed drive in my vdev and then it takes like a day to resilver at current capacity.
so 5-10% chance and then 1/365th of 5-10% odds of failure when resilivering… ofc that can be higher if the drives are more stressed, but my array is vastly faster than my network, so in theory i will always be able to resilver with moderate use and in good time… i would hate to imagine how it would look if i had 90% utilization of the array all the time… that would leave 10% to the resilvering, so 1 day would be like 10 days + much more strain on the drives…
so yeah i feel pretty safe with raid 5, but i’ve been use to just working with single drives in windows for many many years… but that does give a slight degree of bitrot over decades of data storage.
which was really want i wanted to combat by moving to zfs… and then why not get some redundancy also, because i have had disk failures in the past… besides it’s barely a production machine… storj doesn’t pull that much… tho i could use a lot more IO it seems…

i guess there isn’t any perfect solution… just slightly different version of crappy storage lol

Pentium100 · April 29, 2020, 6:50am

raid5 with large drives is not recommended. The reason is that all drives have some chance of an unrecoverable read error, even though the drive is otherwise perfectly functioning. If that happens during a rebuild of a large raid5 array (and it has a good chance of happening) the file that was stored in that sector will be corrupted.

SGC · April 29, 2020, 6:57am

i wouldn’t call my array large… i literally got 5 drives doesn’t get any smaller than that really
with 5 drives in raid 5 i might run it for 1000 years without failing, so long as i replace the drive immediately and doesn’t allow it to spend more than a day on a resilver

maybe a give raidz 3 is the way to go… but i don’t want to think about resilivering on those

50 drives, thats 50-100% odds of a drive failing a year… and then a full read of 47 drives to create the “parity” so basically a full read of the entire array.
lets say 70mb/s pr drive for a day in case of my 6tb drives, which is decent performance if we imagine load on the array also… so 70mb/s x 48 drives so 3330MB/s read on the array for 24 hours to resilver a drive…

and thats not taking into consideration that there might also be some IO limitations, internal bandwidth or such in the system.

well both my memory and cpu’s can keep up with that… ofc now we get to the bottles necks… the hba’s and pcie v2.1 x8 would be 4GB/s but i can split it over two HBA’s so not impossible… but getting closer to the actual limits of that old box being able to take it…and i would have to connect to the 36 of the drives on a 4i SAS lanes, each of 6gbit each, so according to lsi typical speed is 2200MB/s
first bottle neck and really its not to bad… running 48 drives with my current system wouldn’t to problematic… wow

and should resilver at like 66% of regular speeds, if the io doesn’t fuck it…
ofc thats considering an idle array… then at 90% workload we go into the 20-30 days if we assume these numbers are a bit high… which makes it 1/12th 50-100% for another drive to fail… but with raidz3 one should be okay… if one can process the math then… which might be tough

Pentium100 · April 29, 2020, 7:17am

IIRC it is not recommended to use RAID5 with drives bigger than 1TB or so.

SGC · April 29, 2020, 7:36am

i learn stuff the hard way xD…

there are multiple parts of this… maybe ill just stay with 5 drives vdevs and just add more…
i kinda like how fast it is… and even with 12tb drives i could resilver in less than 2 days 2/365 of 5-10% is like 0.033 to 0.066% odds of failure on an avg year… pr raidz1 vdev

from my view, other problems like power outages, internet loss, lighting and what not because more of a danger until one reaches a good deal of vdevs
and i get all that sweet performance xD

Pentium100 · April 29, 2020, 7:51am

Not exactly.
For example, in the datasheet for UltraStar HC310 it says that the unrecoverable bit error rate is 10^-15. So, on average 125TB read there will be one unrecoverable error on a working drive.
For a WD Red it’s 10^-14, so on average once every 12.5TB read.
This is because nothing is perfect, some stray magnetic field may corrupt the sector etc.

So when you have to rebuild a 100TB RAID5 array, how likely is it that at least one drive will encounter such error at least once?

SGC · April 29, 2020, 7:56am

well zfs can deal with a bit error… so that shouldn’t be a problem… i know the math might not work out… but i’m going to give it a try… see what happens…
if i don’t use all of the the vdevs on the same pool, but split them up on multiple storj nodes when i get a few, then i should be able to gauge how bad it actually is and i kinda assume the people that invented raid 5 as a concept knew what they where doing … xD assuming that its called raid 5 because its basically 100% stable with 5 disks if the drive is replaced immediately and i suppose using proper hardware…
and i suppose size of the drives comes into play… like say 12tb drives would be 4 days to resilver… also you assume the degraded array doesn’t have a 5th drive… i most cases a drive will throw errors and give time to be replaced long before it goes down… so a bit error in the rest could be fixed by parity of the degraded drive while resilvering…
and if not then zfs can take it, and then on top of that… most of my stuff presently is basically 40% home media center and 60% storagenode, which also can handle a lost file…

so its not 1 error that worries me to much… its a total collapse of the array, also zfs doesn’t fail like regular raid does… in a regular raid 5 if you inject false data it will break down completely and 50% of the stored data is unable to be recovered…

zfs however keeps such issues more localized, so granted you may get some corruption, but that happens on regular harddrives every day… and the OS figures a way around it… or its streaming data in which 1 bit or 1byte or 1kb… rarely is enough for people to notice it.

but yeah there are many many things to take into account… and one shouldn’t use raid without atleast to a high degree understand the risk one might be putting on ones data… like say a regular raid 5 is actually less secure than data on a single disk… because bitrot can eat the damn array…if one doesn’t keep a proper eye on it…
but it can handle a disk failure, if it fails without causing trouble…

i often pull my drives and resilver… haven’t had any real issue yet… but haven’t run raid for more than a few years… almost took me longer to research what and how i wanted to run it, than i have been running it lol…

Pentium100 · April 29, 2020, 7:59am

How does zfs deal with read errors? By reading from the parity/mirror drive. If there is no mirror or parity drive (as is the case with a rebuilding raid5), then the read error means some file gets corrupted.

SGC · April 29, 2020, 8:09am

if memory serves then zfs also keeps a crc of the files, which allows it to fix 1 bit error pr file if needed

Pentium100 · April 29, 2020, 8:26am

CRC cannot recover an error, just show that there was an error.
It won’t be one bit. Hard drives use ECC to correct read errors. When that fails, the whole sector (512B or 4KB) is returned corrupted (or just a read error is returned and no data).

SGC · April 29, 2020, 8:59am

CRCs can be used for error correction, its why normal harddrives doesn’t make data errors all the time and how we optimize data transfers over cables, else we would get to much overhead.

its basically in everything digital and added liberally, zfs added a few extra layers of these…

and if we imagine a larger nested or whatever its called raidz1
when one can do like say 5 drive vdevs and then do 5 vdevs and then the 5th vdev is the redundant vdev.
ofc nesting the arrays comes at additional storage capacity penalties, but one retains a higher performance, faster resilvering time, and the ability to restore an entirely failed vdev.

it’s the way to scale, you cannot avoid failures, only retain ability to correct them…

lol i wonder if one can run a global hot spare vdev lol

Pentium100 · April 29, 2020, 10:33am

ZFS with a single drive cannot recover a read error. It won’t be a single bit that’s bad, it will be an entire sector that cannot be read (or corrupted).
So, it stands to reason that a raidz vdev with one missing drive cannot recover a read error either.

SGC · April 29, 2020, 5:58pm

thats not what other people say… but i really cannot say… i’ve only known about zfs for a few months… xD also tried to dig a bit into that claim, and it seems that zfs can repair data corruption even when running single disk…or so they say… i always kinda assume it was crc since its so everywhere… i mean you can barely pull a cable without getting crc with it these days…and why would one want to

Pentium100 · April 29, 2020, 6:41pm

CRC is almost never used to recover errors, just detect them. Be it on Ethernet frames or anywhere else.

SGC · April 29, 2020, 9:03pm

parity is sort of CRC, for some reason darpa in their infinite wisdom decided that packets with errors should be discarded… seems kinda weird, tho maybe the hardware layer uses the crc to make minor corrections if needed and thus the TCP can just throw it out if its wrong, because that means the hardware already failed in fixing it…

i just know that its very difficult to send data over any meaningful distance without CRC or the likes
if memory serves even a lcd display will use something like crc to make sure the data is correct comming through its connection… so it’s a very fundamental thing…

maybe there was some sort of security issue with using data that was corrected, or they simply wanted to be able to attack networks early on without people realizing…

xD but yeah crc is basically parity math and thus is used all over because it is mathematically one of the cheapest ways to correct tiny errors… ofc the higher you go the less you need to know about the noise in the cables or inconsistency of bit storage… because that’s not really relevant to what one is doing, it’s only relevant when it fails, which is why we mesh them on multiple levels…

like say raid 5, then multiple raid 5 arrays forming an array raid 5… it’s basically what i decided to build my storage solution on…

it gives us many different directions and scales of parity thus no singular point is critical, and even multiple points or huge areas can be destroyed and the setup tho more exposed, would still function.

maybe i’m just still ignorant of how bad this actually is and is still waiting to learn the hard way lol… it happens xD

Pentium100 · April 29, 2020, 9:44pm

Not CRC, which is used only for error detection, but various other algorithms (like RS that Storj uses). Something like DOCSIS or GPON uses error correction, while Ethernet doesn’t. Error correction requires more processing power and Ethernet is supposed to be used with good cables, so the data arrives intact anyway. DOCSIS has a noisy channel so it needs the error correction.