Raid 5 failed. 3TB node dead. Recovered partially and going through GE thanks to forum

An update on my GE progress:
I seems satellite.stefan-benten.de:7777 succeeded. It’s weird, and seems too good to be true, but I got a Completion Receipt and all. I don’t know how this was possible. I can provide nodeid if devs want to debug, just ask.

2 Likes

Yes for the reason. I was a system administrator and have had 22 branches with chunks of central database and all of them are used RAID5. Too much failures. Even with enterprise drives. Even with small enough (less than 1TB). They fail during rebuild because of two drives failures.
So - yes. Please, do not use RAID5 or RAID0, except you like adventures and thrills.
Every time when the branch is failed, the only way to recover was to unload a branch’s chunk of database and transfer it to them. Of course we have had backups. But they was useless because of constant flowing of data (like in Storj) between branch’s database an central database (unlike Storj).
But result is the same - the backup is out of sync and recover the sync was a pain. So, simple to unload the chunk and send them.

As result I migrated all branches to RAID10. The problem is gone. Disks are keep failing, but no one branch has been broken since migration (more than 5 years).

4 Likes

I know more than enough :slight_smile:
We have a choice, lose everything (DB’s+Customer DATA+Escrow) or lose just DB’s and make GE.
I think the choice here is obvious, right?

for obvious reasons, data on new models are not there.
Well, backblaze is not a panacea. There are no drives at all that I use

ouch sounds like the wrong job for raid5…
and i kinda think people tend to miss that the drives are acting up… i know i’ve done so repeatedly then all of a sudden 16 days after a drive has failed… i notice lol…
and if they are all the same type of drives maybe dying of age… then i could see how they might fail and the system doesn’t even notice because it doesn’t stress the drives until one start to rebuild and then they just choke and roll over…

so i will agree that raid5 isn’t safe, but in theory it should be safe enough if correctly maintained and so long as one doesn’t have the capacity to run anything better… with my current setup i can only go raid 5 or mirror… and mirror doesn’t improve my speed atleast write and only has 50% data capacity… with raid 5 in a 5 drive array thats 80% data capacity and i can loose a drive… even with a raid 6 type solution i would be back near 50% capacity and i would loose performance…
so then i would be better of doing mirrors because then i can add and remove vdevs more easily with zfs.
so i don’t really see raid 6 as an option on anything less than 3x(8 drive raidz2 / raid6) getting 75% storage capacity… great performance, quick rebuilds of drives anything less well i guess 2x(8) could work, but its performance would start to get kinda weak…

Of course RAID5 will save you from the one disk failure. But it will die with a high probability during rebuild, when you replace the drive. Larger disk - large chances: Hardware configuration and receiving mail with token

well you just said you lost

and they where on the 2014 list where i remembered seeing them, so some part of the logic here fails…
sure you may not be able to get the latest state of the art harddisk technology, but if you want endurance why would you want to… thats like nasa sending the newest generation ryzen processors up in a satellite they expect to last 40 years…
they don’t do that… they use a older technology which has been proven tough, and have been further redeveloped into a hardened processor costing like 50k to millions $ a piece.

ofc thats not option for us mortals, so we get older verified good models… and then build in some redundancy and make sure the system is well maintained.’

you never want new tech for durability… it always a gamble long term and you want to know what you are buying if you want it to last.

1 Like

@Alexey: Your posted image is for Disks with 10^14 bit-failure. Nowadays wd red are in that range (not the PRO-Version). But e.g. Segate Ironwolf (not PRO) already has 10^15. That means recovery-rate of 8x4TB not 10% successs, but 80% !
10^15 bit ist 113 TB of guaranteed writing correctness.

What about the idea to run multi-nodes on a raid5? In case you get a recovery bit-error maybe not ALL of your space is currupt and you do not loose ALL nodes at once but only the one node that has put data on the corrupt sector?
Or am I wrong, and bit-errors will lead to a total raid-loss?

So is is raid6 or no raid! ???

That seems way too fast. Only 1 day for GE? Mine took a lot longer… What was your upload speed?
How much data did stefan-benten still store on your node?

I am sure there’s something fishy here. The amount of data transferred is nowhere near the amount stored. Either there are a lot of orphans on my node, or another kind of error occured.

I started GE before the dbs were rebuilt, figured I didn’t need them anymore. Maybe this caused the problem. However, I was under the impression that the sattelite coordinates the exit, from its end.

I guess we’ll see when the payout comes.

Hmm… that will for sure be interesting.

There are 2 images in that post. 8x4TB would indeed be around 80% success rate. I’d say that’s still not acceptable, but you may find that reasonable enough. It goes up fast even with those 10^15 disks though.

Excellent!
let’s remember that one “boy” to said :wink:

and you did it!

1 Like

long live raid5 xD

i’ve been fighting a lot with my corroded backplane this last month or perhaps a bit longer…
it’s given me a rich number of errors to evaluate the stability of my raidz1 which is basically raid5…
even tho it should be a tad more enduring in some cases, it seems to me very much so that the excessive rebuild times that is seen in some older arrays during rebuilds could actually be due to bad cables, more drives having failed but due to low activity of the array the bad or failing drives have managed to stay under the radar…

then when one starts the rebuild drives kill into overdrive a start dying right and left… or having trouble thus increasing latency and extending a rebuild that should have taken days into the month range.
this ofc would kill off the remaining poor performing drives in the array and thus it fails…

so i think there would be a good argument to be made that most raid5 arrays in the 6-7 drive or lower range that fails on rebuilds, essentially has failed long before the rebuild process is started, they are just zombies… and thus in my views proper maintenance, scrubbing / patrol reads, performance monitoring and alerts should minimize such risks into acceptable levels for a storagenode.

i’m not saying people shouldn’t run 8 disk raid6 instead… i’m just saying 4 disk raid 5 is so much cheaper to setup, and if correctly maintained shouldn’t fail… ofc one should have a spare for when one drive dies… and then one might as well go raid6… ofc a hot spare could be a global hot spare across multiple arrays, which is most likely what ill go for when i get my little home brew datacenter a bit more established… not sure if ill stick with “raid5”… atm i sort of feel like zfs was made with only running mirrors in mind… which kinda sucks… so painful to reconfigure…

It is and it isn’t. ZFS does a lot more to try and restore any data it can and your entire array most likely won’t fail due to URE’s. You may lose some data, but your array will survive unless another disk fails entirely. I have limited experience with ZFS myself, but I’m pretty sure the two shouldn’t be compared when talking about this issue.

well in gross basics, you got 1 drive redundancy… doesn’t have to be more advanced than that…

drop nr 2 or have uncorrectable errors and you got corrupt data even if the entire zfs pool might not die because of a bit of corruption the raid5 would get like 50% corrupted…

yes zfs is very different no doubt about that… but the odds should be closely related…

i did read something interesting about RAM which kinda rang true with some of the stuff in relation to raid and failures.


very interesting read, one of the things they note is that ram modules that are connected to each other and experience a correctable error in one module has much higher chances of experiencing an uncorrectable error in a different module…

it seems to suggest that failure is contagious might not be the best word, but something along those lines… stuff fails because its connected to other stuff that is performing poorly…
which i suppose is kinda true, and which could very well be related to the issue with raid5 not being good enough… i mean the math sort of says raid 5 should be good enough for arrays of limited disks, but i don’t have enough experience with it to validate that… this study however does suggest that this kind of failure might be a more wide issue.
not restricted to raid, which i find kinda fascinating and sure maybe raid6 is the only way to go in the real world… if you want the array to survive…

i just found it so interesting that it seems to apply to storage tech that is so different from hdd’s

if one supposes this dimm issue on to hdd’s in raid arrays then read errors in one drive could lead to breaking other drives in the array, thus killing a raid5 basically… would explain with raid6 is really needed.

i mean how many decades have people not been storing stuff on single harddrives… and sure they may throw and error here and there, but they are pretty damn reliable… so raid5 should be plenty … in mathematical terms

ofc its not always reality wants to conform to the mathematical approximations of us humans.

I was saying “Oh boy”, as in “I’m in over my head”, most certainly not calling you “boy”. Just wanted to make this clear :slight_smile:.

Thank you for your advice, it helped a lot, really.

4 Likes

That just means you’re learning… if you’re not confused, you’re doing it wrong.

2 Likes

You are welcome!
Don’t worry I just joking, today we have a lack of positive, I just try to bring some smiles :slight_smile:
I really glad to hear that you successfully did GE and save deposit and customer data.

1 Like

To me, the model number ST3000DM001 is synonymous with data loss. Let’s just say that I am never buying a Seagate product again in my life.

3 Likes