Data loss, using raid

yeetloaf · March 16, 2023, 2:35pm

Hi, is there any info what happens in the case of data loss, for example a whole drive dies ?
I have seen some pictures of setups with single external drives, that would likely fail as a whole.

Does the node then just die ?
Is the node operator punished in some way ?

Same direction: is it recommended to RAID drives to mitigate any failures ?
If there is no punishment, you would throw away 50% (RAID1) capacity for loss-prevention that maybe is not even needed ?

Thanks for all Input

Knowledge · March 16, 2023, 2:54pm

If the data is lost, the node will fail audits and eventually be disqualified. The operator can start a new node. The operator is not impacted other than the loss of the escrow that goes to other nodes rebuilding the data that was lost.

As for RAID… One node with redundant storage. Or two nodes with twice as much data. I would do two nodes, but some do RAID. Some RAID configurations are not recommended for use with Storj.

arrogantrabbit · March 16, 2023, 2:55pm

Prior thread on the same topic RAID vs No RAID choice

yeetloaf · March 16, 2023, 3:37pm

Thanks for the links, especially the discussion about raids. That about answered all my questions/concerns!

SGC · March 16, 2023, 3:50pm

disks rarely fail catastrophically… so usually one can tease the data out of them… just takes a while and might have some errors.

the problem often becomes that one doesn’t notice the errors before the damage is to much.
audits should basically never drop… if they do on a single disk node, one need to verify that the disk is okay.

this allows for time to switch to a new disk, before the disk errors become to much for the node to survive.

i have many bad disks i’ve taken out of my setup, which now work as media storage and such in mirrors and works fine when they don’t have to do 24/7 workloads.

thats the most common disk failure i see… disks starting to throw bad data when having 24/7 workloads.

BrightSilence · March 16, 2023, 8:26pm

I have both been very lucky and very unlucky. While I’ve had only 3 or 4 HDD failures in the last decade (with anything from 14-20 HDDs running at all times), all of my failures have been instant death. Fortunately all of them were in a RAID setup, I haven’t had any node fail, despite running single disk nodes as well. These numbers are definitely not representative though. The sample size is just too small.

SGC · March 17, 2023, 1:03pm

well a raid would most likely kick the disk out… so it would seem like an instant death…
zfs does that also but i usually just shove them back in
atleast until i get them replaced… i rather have an unstable disk than no disk in a zfs raid…
not sure about the others but for raid6 certain the same.

out of them 5-6 disks i’ve replaced in the last couple of years, non of them was instant death and most of them are still storing data, just out of 24/7 operation… had one that died afterwards… just stopped completely… but it was already throwing errors for a while before that…

also most of my disks that went bad was due to me causing shocks to them… such as bumping into the unstable table my server was located on for a long time… after a couple of disks went bad, each time due to me bumping the table, i found a better table

i also been using used drives, but the last year i’ve bought only new 18TB’s … which seems to have helped a bit with how often drives go bad… tho already did lose one 18TB
might really need a proper rack lol, but its not really that uncommon for new disks to fail… i think the avg is like 4% or so for the first year… then its usually like 2% or less if one gets a good model.

best advice i got to anyone running stuff that works, is don’t change anything…
and don’t touch it… lol stuff tends to just keep working … its people that break stuff…

looking at the Backblaze disk AFR, there is a pretty wide gap between some models of disks, some are like 0.5% Annual Failure Rate and others are like 4% excluding the rare 20%+ AFR events that sometimes happen.

so running a single disk node can certainly be good, one might get lucky and run for decades without issues… or the node dies on a new disk in 3 months.

tho my main reason to use zfs with redundancy was to avoid software issues, i figured maintaining stuff and such could be a larger factor with storj software being new.
with zfs and redundancy errors are so unlikely to happen, that stuff should just run forever if the software is good.

the secondary consideration was that it takes years to get nodes to proper sizes, so losing them would be a big setback…

is it worth it to run raid for storj… yes and no…
old big nodes certain makes more sense to run on a raid and new nodes are basically just worthless, so running them on single disks makes the most sense.
also optimal iops out of the hardware from single disk setups…

raid is so restrictive on iops.

BrightSilence · March 17, 2023, 1:25pm

Oh no, I tried. Those disks didn’t do anything anymore. I wouldn’t mind just running a node on it separately to see how long it would last. I had one be kicked out of my drobo after a single read error and that has been running a node for years. I don’t even consider that one a failure as it never had a single issue after. That was more of a drobo failure in my eyes. Which btw, registers the ID of the drive and never accepts that disk back again. (Side note, don’t buy drobo)

Yeah, bit they take most HDDs out of rotation after 5 years. Failures go up pretty fast after that. 50% of HDDs don’t survive more than 7 years. Which I simply haven’t seen happen with my own unrepresentative sample. I guess I have just been lucky.

SGC · March 17, 2023, 2:02pm

i don’t think thats whats happening… afaik they remove the HDD’s from a datacenter when they are at 4½ years of age with still ½ a year remaining warranty, this allows them to sell them to 3rd party resellers without the reseller having to worry about buying them used, because if they are bad they can get them replaced on warranty.

this way they can get the best possible used prices for their gear, be sure that 3rd parties want them, and then buy new higher capacity disks and thus save on electricity costs and increase datacenter density.

duno how much data actually exists on running older drives… from the numbers i’ve been seeing used does seem to fail more… but rarely catastrophically, got some disks that have been running 24/7 for 10-11 years now…

some claim there usually is a wall and when this is reached for a certain model the disks just dies in droves.

BrightSilence · March 17, 2023, 2:28pm

They have some info here on drive survival rates up to 6 years with a projection for year 7. How Long Do Disk Drives Last?

dragonhogan · March 17, 2023, 3:15pm

good info, for sure.

I had my first drive failure a couple of months ago. It was a single node running on a raspberry pi 4, usb-connected 4TB 3.5" WD Red. The drive was re-purposed from originally being used in a synology NAS. If I had to guess it probably already had about 2 years of power-on time before I started that node, which was about 1-2 years. The drive failure wasn’t catastrophic, but in the end I couldn’t recover the node data.

At the time it was 1 of 6 nodes I had running at my home. So it only had about 1-2TB of data and I just had to accept that it was gone. I started too many nodes anyways too close together, for my single IP, so there was no need to start a new one for that failed node.

SGC · March 17, 2023, 3:22pm

another thing for consideration might be temperature, i know that my passively cooled disks seems to have a higher failure rate than the actively cooled ones in my disk shelves and server.

here there is yet another backblaze publication about how temperature affects them.
i don’t really have any number of disks to get data like these guys
i have however seen disks that give errors while not actively cooled, which then stop giving errors after i move them to actively cooled, which again indicates just how much this seems to matter.

for people running external HDD enclosures, it might be a wise choice to go the extra mile and try to get active cooled enclosures.
and at the very least not group disks to close together in a confined space, so they cannot breath.

BrightSilence · March 17, 2023, 4:50pm

I don’t think their data helps much for the average home user as their drives all run in load well cooled server setups. My drives are all in actively cooled enclosures and run between 32C and 36C. All outside of the ranges they even tested. I have my synology set to the cool fan setting. You can set it to max, but it just gets too loud and I work at home sitting right next to it, so no… not gonna do that. And I’m not even sure it would drop the temperature by that much. I think in closed external HDD’s they will get much hotter though and yeah, when you get over 40C there may be some impact. Especially when you get closer to 60C. But as long as you have any kind of airflow over your HDD’s I wouldn’t worry too much about it.

arrogantrabbit · March 17, 2023, 5:07pm

I agree. Neither Backblaze nor that google famous article is applicable for low scale and home users.

Anecdotally, I stopped worrying about disk temperatures about a decade ago, my servers are in non-air conditioned patio (in north california, so kind of a mild-ish climate) and the fan management configured to minimize noise and fans power draw while keeping disks under 60C (I.e. within specification in the datasheet).

I did not see any impact on longevity or failure rate compared to before when I was focusing on keeping disks under 32C because I believed and/or misinterpreted those articles.

Even if there is some difference, it must be negligible compared to other factors — such as new disks are more likely to fail than old disks, and random chance of getting a bad disk. Yes, couple of dozens of disks is a very small sample size — but that’s kind of a point.

SGC · March 17, 2023, 5:14pm

most of my disks run at 20 to 30C while actively cooled… i duno how much if affects durability of a disk, but i still think it matters.

excessive heat kills most things.

arrogantrabbit · March 17, 2023, 5:37pm

Right; on the other hand — operating range for a device is specified in the datasheet, for most drives upper limit is 60-65 C, so it’s not “excessive”, it’s within spec.

Will the disk that runs at 60C age faster than the one at 30? Definitely, it’s just physics. Is the difference worth worrying about? Probably not, as the difference either negligible (other factors affect more) or irrelevant (it does not matter if the drive could work for 20 years at 30C vs only 8 years at 60C if I’m going to replace it at 5year because it’s obsolete/small)

SGC · March 17, 2023, 5:55pm

true storage capacity does seem to get obsolete way before it will wear out in 98% of all cases.

thedaveCA · March 20, 2023, 10:24pm

Drives, and their controllers… Especially hardware RAID when they’re working hard such as a rebuild or sustained write.