Suppose the following:
A node has been running with 100% availability for 24 months.
In the 25th month, one of the disks fails and the SNO takes the node down for 12 hours to rebuild the disk array (rebuild from parity).
After being offline for 12 hours, the SNO restarts the docker container using the same identity, same storage-dir, same everything.
Would this be a good strategy to handle a failed disk?
Would the node be DQ’d for being offline for 12h (exceeding the SLA 5h of downtime per month), even after having 2 years of 100% availability?
If a small number of files (example: 5 out of ALL the storage/ directory files) could not be recovered from parity, could the node still resume with its original identity having most of the files?
I’m trying to figure out the best way to handle a future disk failure and still preserve the node.
I read about the strategy of running “one node per disk”, and discarding the node if a disk fails, but this seems wasteful if 99.9% of the files can be recovered using a disk array with parity.
“Parity is a waste of space” you say? In my setup, I already have parity protection for my personal files, so I’ve already “lost” the space. I can protect the storj directory “for free”, but the rebuild time will take ~12 hours.
You could do that. The disqualification for downtime is currently disabled. However, we implemented the online tracking system, and it’s working like described here:
As soon as we collect enough stat - the disqualification for downtime will be enabled.
FYI - with today’s disks the rebuild could take days and could end with died array with 98.86% probability:
Depending how much data is lost. Your audit score will be affected anyway. If audit score lower than 60% - node will be disqualified.
If your node is lucky, it could avoid disqualification:
this is will be 1/8 of array in case of 8 nodes vs array with 8 disks. Or the whole array, if rebuild would fail in case of array.
The full discussion regarding RAID vs No RAID you can read there:
If it would be offline longer than 4h, all pieces on the node will be considered as unhealthy, so if the number of healthy pieces of the affected segments would be lower than the threshold, the repair workers will start to repair these segments. As result the pointers to your node will be removed. So, as soon as you bring your node online, this data would be slowly removed from your node with a garbage collector (it’s initiated by satellites once or twice per week). So, in short - the longer the node is offline, then more data will be moved to the trash later. It will be permanently deleted from your node after 7 days in the trash.
So, kind of “yes”
So, given state of my node(array rebuilding, node slow), and the network, i should keep it offline until raid finished, then turn it back on (reduce space to prevent ingress??), and let it cook for a week to catch up with all the data cleanups. Then re-advertise space.
I’ve never had issue like this expanding before, but I’ve never experienced growth like this either.
Not neccessarily keep it offline, you may just reduce the allocation below the real usage, it will stop any ingress, but your online score will not be affected. And you already know this
No. This has been debunked many times. Think about it — rebuild is what scrub does. Your array does not die after every monthly or biweekly scrub, does it?
There is no need to take the node offline for the rebuild. You can reduce iops by decreasing allocated size temporarily, but I would not bother. What’s the hurry? Let rebuild take 3 weeks if needed.
And that’s fine. Keep the node running. In fact, as we speak, I’m replacing 4 disks in one of my raidz1 vdevs with larger ones. I did not stop nodes. Everything continues working normally.
this is only true for ZFS or Synology’s implementation of RAID based on BTRFS unfortunately. All other solutions, especially hardware ones or a native BTRFS may fail.
Even my favorite LVM…
The only difference is handling of the bit rot: when disk returns bad data without reporting an error, conventional raid has no idea which copy of the data to trust, and picks wrong one in 50% cases. Bit rot is pretty rare — disk have internal crc checks too.
But this is not specific to rebuild or scrub scenarios, and can happen during normal use.
Outside of that difference, both convention raid5, or ZFS based redundant vdevs, or Synology’s smart plumbing for BTRFs around md — all support scrub.
Scrub must be ran periodically to ensure data viability. And during scrub exact same IO and calculations are made as during rebuild.
And yet, raid5 arrays don’t die after every scrub. Hence, those stats don’t make any sense.
yes of course. However, when the one disk is replaced, the ex RAID5 will become a RAID0-like but with a degraded performance and it would have a higher load pressure on remaining disks because it’s rebuilding, thus the probability to have a second fail (or at least to find a bit rot) become much higher. I just share my own experience and the disks were much less in size that time.
yes. And sometimes it help to detect an issue earlier, or in case of systems with checksums also to fix it.
My point is that io pressure from scrub and rebuild are exactly the same.
Why? Each individual disk does not know whether those reads are because of scrub or rebuild. Why would rebuild has higher chance of failure that scrub?
Did you try to rebuild, not scrub, a usual hardware RAID5 without removing the usual load?
You probably would never understand me, if not
The scrub is checking in a first place, not rebuild.
When you rebuild, it works differently and puts a lot of load, if you didn’t remove the regular load. The degraded RAID5 is crawling, not working. It’s significant impact to every single operation. I do not have any numbers on my hands, that’s was a long time ago. But I do not believe that they something really improved since than.
The software RAIDs - it’s a completely different story, you may have a very good performance even in the degraded mode (e.g. 2 disks from 3 of RAID5 + 1 rebuilding).
I don’t disagree than many hardware raid solutions are horrifically slow under load.but it’s true for both rebuild and scrub.
But yes, I was implying software raid — there is no reason to use hardware raid today.
Hah, this is interesting. To a non-techie like me I would expect bespoke hardware to be a lot faster than a software implementation…
If hardware RAID is so much slower why would you bother with it in the first place?
Hm, it’s faster in the primary function when all HDDs are ok.
…because it doesn’t use your CPU and the limited IO bandwidth of the bus to make a job done…
…They also usually have a power backup as a cell battery…
even a Pi 3 can beat my old I7 (2009) on that job with a HW controller like this…
The line between “hardware” and “software” is very blurry.
What is considered “hardware raid” is still a piece of software just running on a different processor. It might be more specialized on some way - e.g. it may have dedicated ram with independent power to maintain un-fllushed data during main system resets, or have hardware accelerator for common operations like parity calculations — but ultimately this is just another little computer.
In one hand you are right. Since it’s fully dedicated to just one task it can do the job much faster and better than running on a shared main system general processor, that may has limited resources and may not be as good as doing operations needed for raid in bulk — which was indeed the case until recent decades.
On the other — everything is cost optimized and there is a big difference between enterprise raid controllers and the ones built in to consumer motherboards that struggle to even maintain declared performance.
On third hand, more complex solutions emerged, that go beyond simple raid with parity — like ZFS — that combine filesystem functions with disk management, caching (that uses all all available underutilized system memory), compression (a lot of compute power), encryption(modern CPUs have encryption engines), and even deduplication (again, massive ram requirements), and with pretty much unlimited available CPU performance become much more attractive and flexible compared to specialized raid adapter.
Maybe there are still some niche applications or some legacy systems where some basic but dedicated raid can be still acceptable, but for most users modern solutions like ZFS are “more better” in every respect.