In the current low ingress, high deletes environment, is it best to have RAID?

andrew2.hart · October 19, 2021, 6:42am

I think we mostly agreed last year that single disk per storagenode was best.
Is this still true?
As a “whale” with 10TB stored on two 12TB disks, am I better off moving to a mirror?

Pentium100 · October 19, 2021, 6:55am

I always said that RAID is better. If your drive dies, how many years will it take you to get back the 10TB?

Toyoo · October 19, 2021, 7:51am

Really? I thought it’s «at least one node per drive».

BrightSilence · October 19, 2021, 9:29am

I don’t think we agreed on anything.
But the trend has shifted to lower ingress for now. This will make the time to get data back longer, which shifts the balance a little towards preferring RAID, but only if you have the HDD’s to spare anyway. Buying more HDD’s for RAID still sounds wasteful to me. And wasting space that could be used by nodes also sounds wasteful still. Basically if you expect to be able to fill up the space you have available within a year, then don’t waste space on redundancy. But if you have 3x 16TB lying around… yeah, but all means use one of those for redundancy.

SGC · October 19, 2021, 5:28pm

i would say new nodes on single disks, older big nodes on raid.
raid is also low on the iops, something which a storagenode is high on, while the bandwidth is where a raid shines, against a storagenode which doesn’t use much bandwidth at all…

so raid is really not great for storagenodes as all advantage and disadvantage are adverse to what we as SNO’s want…

but to keep a node alive for sure long term one will need raid, the only real discussion would be how long is long term.

what are the actual odd’s for a node to die in the first, second and third year, i’m sure there exists lots of datacenter information about life of drives, tho we might be able to just use manufacturer specs.

one thing i find really interesting is how often my zfs has kept me from getting errors for various reasons… something so simple as doing a hotswap or reseating a drive can often cause errors on other disks in my disk shelf.

also disks can be weird, got this one drive which acts up when it gets cold and yet if i reseat it, then it runs just fine… sometimes hdd’s are just weird…

but that isn’t really that weird considering they are basically the magnetic iron version of the phonograph vinyl, doesn’t take much to disrupt hdd’s.

scottpk · October 19, 2021, 6:38pm

Some form of redundancy is always good in terms of data preservation. Data loss is not always obvious, it’s sometimes not caught until too late (think about it - if last year’s tax file got lost, would you even know before you have to look at it for next year’s taxes?) so there is not much assurance you’ll catch any failure before you can migrate to a new drive. With RAID and other systems you have mirrors, error correction codes, parity drives, or other things which give you a good chance of being able to recover without headache.

andrew2.hart · October 19, 2021, 8:42pm

If 1 of my two disks break now, I lose 4 or 6 TB. That will take a year to get back?
5TB x 12months x $3/TBm = $180 penalty + $280 replacement

Adding a disk and making raid 5 would keep the same space and cost $280?

If I do raid 1, it has no cost but would need to expand in … 6 months at $280 cost

Here is where I get stuck tho, what if the disk doesn’t break

BrightSilence · October 20, 2021, 1:04am

2 notes on that. If a disk breaks, you would have to replace it in either setup. So why are you adding that cost to one and not the other? I would just leave that cost out.

So, you’d lose $180 on failure. But you can spend $280 to prevent that loss. Doesn’t really add up, does it?

But the second note has a much higher impact. You would have to multiply that $180 loss by the chance of that happening. Which on modern HDD’s is less than 2% chance per year. That’s an effective loss of $3.60 per year, which you could spend $280 on protecting against. The math doesn’t add up.

Pentium100 · October 20, 2021, 4:28am

Since the node is supposed to be running long term, the hard drive will die eventually. While the AFR of modern hard drives may be less than 2%, as the drive gets older, that number goes up. Will you catch this moment and replace it before it fails? Or will you just replace the drive once its warranty expires to be safe?

For me it’s rather simple - I expect to be replacing drives in my server while still keeping the node. I hope to not be replacing nodes, because I do not even know if I would restart if my current node fails, because for years the earnings would drop to “not worth it” levels.

Also, while currently ingress is low, some of it is test data, which means that it can be still lower.

If I am providing a service that requires reliability and data integrity, then I should do it properly, just like if I was running a small email or web server for someone (even though those can have backups and, as such, can be a bit more tolerant of drive failures).

Chia or similar can be on separate drives, because of two reasons:

More space gives me more coins immediately - I can utilize all of the space I have
If a drive fails, I only lose the coins I would have made in the time that it takes me to replace the drive and redo the plots. Neither of which takes years.

andrew2.hart · October 20, 2021, 5:05am

I guess another scenario is that I could move all onto one of the disks. Set the big nodes to get no ingress and leave one as the ingest node. Then I could back up the big nodes to the spare disk and keep it safe.
Or just sell the spare disk

BrightSilence · October 20, 2021, 7:40am

This is definitely true, but that actually also works against you. If you are running an extra HDD in your setup, you also need to replace an additional HDD in case it fails. So that’s a theoretical annual cost of 2%*$280= $5.60 a year. And as you said, that cost will go up the older the HDD gets.

For context btw, the less than 2% chance of failure is sources from backblaze statistics. It’s actually pretty much 1% in the more recent reports, but they use enterprise HDD’s and they replace HDD’s after 5 years of use. So I like to keep some slack in the stats to compensate for that.

But you wouldn’t have a single large node if you would have ran a node per HDD from the start. So it wouldn’t be a matter of losing everything.

So I think this is where our difference of opinion originates. This seems like more of a principled stance or moral argument. Which I understand, but I don’t think it is valid here. You are not individually responsible for the reliability and data integrity of the services provided. You should only worry about what makes financial sense and when you do that math, it is simply more profitable to not bother with redundancy. If I’m wrong about that, then you should be able to point out my flaw in my calculations. (wouldn’t be the first time I overlooked something, so please let me know if I did)

In the end I think the decision should be made on just what is more financially profitable. And currently it seems that if the loss of annual income would be less than it costs to replace an HDD, it can never be viable. If that loss is bigger, it may eventually be viable, but you would still have to earn back the cost of that additional HDD with the difference. And since the loss of income dwindles over time, but the cost of having to replace an additional HDD when it fails in your setup doesn’t, you better make back that money in the first year or two. I don’t think that would be the case for most setups and it’s definitely not the case with the options @andrew2.hart outlined.

I really like this post because I never considered that backups can be viable on nodes that never receive data anymore. You would need to make sure that deletes don’t free up enough space to start receiving data again. (So basically lower it to 500GB to prevent that from happening)

That said, the economics for backups are even worse as you would need to use by definition twice the space.

SGC · October 20, 2021, 9:05am

i run raidz1’s with 6 hdd’s to optimize, so i only loose 1/6th of my capacity, power costs and such.
and then because it’s redundant i avoid data errors which can cause all kinds of trouble, even tho i run old hdd’s which in many cases has 60k hours on them, i don’t really see a lot of dying disks.

so far in 20 months with 12-18 disks, i only had 1 fully dead old 3tb disk i’ve taken out of production and however i will repeat i often see data errors for all sorts of reasons, so long as i don’t touch the server it’s fine, maybe this is loose tolerances on old disks…

also i lost count on how often i’ve had to reseat a drive, because it spits out errors, can’t say how often this sort of thing would happen to new drives, but my next batch of disks will be new, so time will tells…

i won’t say a 3 drive raid is a good solution because it isn’t, its not much better than a mirror in capacity efficiency and it is much more difficult to work with compare to a straight up mirror.

2% chance for a drive to die per year, maybe… but odds are there will be other problems along the way such as bad cables, poor connections, or just random weirdness.

and audit failure add up over time on older nodes, but mostly i just hate to deal with the troubleshooting when the software breaks, this is also fairly new software so it will lack all the multiple layers of redundancy built into decades old code.

i think my solution of paying 1/6th is a good trade off, ofc it does require a fairly large array, so maybe i’m not the best to recommend raid or not, because i feel like its well worth it, just looking at how many thousands of errors i’ve had over the 20 months i’ve been a SNO.

sure one can most likely get away with runnning a node for years on a single disk, and it will only get better as StorjLabs improve their code and node emergency/redundant procedures.

most likely 2 or 3 years on a node before moving it to a raid might be perfectly viable in the current ecosystem… 1 year for vetting which is like the first couple of TB’s
fits well on old 3TB’s or whatever one has to spare… then when its vetted it would get about 500gb ingress a month, so 6tb in a year and thus in year 3 it would start to be into the 10-15TB range and one can start thinking about putting it on a raid.

ofc other benefits of running a proper raid drive is also that your own data will avoid bitrot, if you plan to use your raid only for storj, then its most likely not very viable.

and really do we have enough data… all we know is nodes die over time on single drives, if the odd’s are low enough raid isn’t worth it for anything else aside from avoiding troubleshooting weird data errors…

and one day Storjlabs code will be solid enough that even random data errors injected into it won’t crash it or cause issues, maybe then raid certainly won’t make sense.

Pentium100 · October 20, 2021, 9:50am

I would have had much higher management and monitoring overhead with 5 or more nodes. Even if they were just in separate VMs and not separate physical servers.

It does? As the node gets more data, the loss of income if it dies also goes up, unless ingress goes up to compensate. I would need something like 5 years to recover my 20TB. I would need less time to recover if my node only had 1TB.

Hard drives usually get cheaper over time.

But they can be offline, in a tape or even online with a slower SMR HDD.

Probably. I hate doing the same thing over and over again expecting different results. I also don’t like when whatever mining or farming setup does not produce any income. I try to avoid this by not having to restart my node. I created it, went through the period or “not worth it” income and now I get enough. Doing that again, knowing that ingress now is lower than it was before, would be difficult.

BrightSilence · October 20, 2021, 10:05am

You would if you used VM’s, but luckily there is absolutely no need to do that. You can run them in separate containers just fine.

Yes it does. First of all, you’re again ignoring the fact that you wouldn’t lose all 20TB. You’d lose only what is stored on the failing HDD. I don’t know how many HDD’s you have, but I’m guessing that would be no more than 4TB (probably less). This would still set you back a few years, but the difference in income wouldn’t be that large and that 4TB difference gets smaller over time because nodes with more data are hit with more deletes, which causes their used space to increase slower than nodes with less data. This is why the loss of income becomes less over time.

Yep to both of those. That’s true.

Yeah, I get this. I think it’s a classic case of loss aversion bias. The perceived impact of the potential loss feels much bigger than the investment you would have to do to protect against it. On bigger nodes like yours you can probably afford being a little more cautious even if it may not be the most financially rational decision. After all it does also save you some hassle.

Pentium100 · October 20, 2021, 10:58am

By the way, when I created my node, there was a limit of one node per IP. That limit was lifted after some time.

At the time, the requirements were very strict (no more than 5 hours downtime etc) and if the node died, you had to go to the end of the queue for a new invite. Right now, the uptime requirement is such that if my node was in a remote location, I would be able to go there and replace whatever hardware was broken. With the previous requirement (5 hours), a remote location would mean using a cluster with everything redundant.

And since I use RAID, I can use the free space in the pool for my own files. If I did not use RAID, I would have to dedicate those drives only to Storj.

As for the calculation with probabilities, well, AFR of 2% does not mean that the drive will last 50 years, even on average.
The way I understand this - let’s say I have another external IP (also let’s say that it’s a different location and I cannot use my current server and just create a VM) and want to set up a new node.
I have two options:

Do what I have done for this node - raidz2 with 6 hard drives, then add more drives when needed. Or even a two drive mirror - with current ingress it would be enough for a while.
Single drive. I would have to monitor it more closely and probably keep a spare drive which I could use to try to copy the data if the “main” drive shows any signs of trouble. If the drive starts at up (but does not just fail completely), I would have to try to recover the data and then hope I don’t get multiple consecutive audits for the missing files. If my node gets disqualified anyway, now I have to go buy another hard drive (or use the spare one if I had it) and start everything all over again hoping that “well, this time the drive will last 10 years or will give very clear signs of trouble before actually failing”.

Option 2 sounds like the “more stress” and “more hassle” option, which may save me some money because I would need to buy a second drive later instead of sooner.

Adavan · November 17, 2021, 3:37pm

Personally, I am definitely for the RAID. I use it in combination with LVM. As mentioned here, the harddrive may be OK, but it doesn’t have to. The HDD may have errors that its FW may or may not be able to fix on the fly.
Also, I started at a time when the requirements for the node were more demanding than they are now.
I agree, it’s a bit more financial overhead, but I’m calm, no stress, what I’ll do if one HDD is damaged or completely destroyed.
I’d rather pay for two identical large HDDs, knowing that I’m only going to use the capacity of one, but I’m sure that when something bad happens to the other, I haven’t lost my data, my reputation, my node.
This setting gives me the ability to increase the node capacity at full operation without any downtime.

Alexey · November 17, 2021, 7:00pm

did you check your held amount? Is it cover the price of a new HDD? Even if you include time to fill up the similar drive when the original one will die?
Of course if you use this RAID for something else it doesn’t matter. Or you don’t care about money too much.
Sometimes fun can cost more

Adavan · November 17, 2021, 7:29pm

Yop. Sometimes some HDDs stop working or have some errors. This high redundancy is very resilient . I have many refurbished HDD with very unpredictable lifespan .

It is looks something like this:

  PV /dev/md7    VG default         lvm2 [<3,64 TiB / 0    free]
  PV /dev/md10   VG default         lvm2 [465,63 GiB / 0    free]
  PV /dev/md11   VG default         lvm2 [465,63 GiB / 0    free]
  PV /dev/md9    VG default         lvm2 [<7,28 TiB / 0    free]
  PV /dev/md3    VG default         lvm2 [465,63 GiB / 0    free]
  PV /dev/md6    VG default         lvm2 [1,36 TiB / 0    free]
  PV /dev/md2    VG default         lvm2 [<931,39 GiB / 0    free]
  PV /dev/md1    VG default         lvm2 [931,38 GiB / 900,77 GiB free]
  PV /dev/md8    VG default         lvm2 [465,63 GiB / 0    free]
  PV /dev/md5    VG default         lvm2 [<5,46 TiB / <5,46 TiB free]
  PV /dev/md4    VG default         lvm2 [465,63 GiB / 0    free]
  Total: 11 [21,83 TiB] / in use: 11 [21,83 TiB] / in no VG: 0 [0   ]

MD5 is actually my newer mirror (2x 6TB). Previous here was 2x 320GB .

Many mirrors have non-recommended combination like 2.5" + 3.5", or 5400rpm + 7200rpm … but as you, this is working few years.

You have right, I could double my storage capacity, but I must sacrifice resilient .

So, I want to be sure, that my node dont stop working in one pretty day .