Distributed architecture for SNOs for high availability of storage nodes

pietro · May 8, 2020, 6:42am

Yes, but I think the two things are strictly related. If the platform doesn’t allow you to rollback to a previous consistency point (and this is driven by the metadata), you cannot perform live backups.

In my opinion the framework should allow reconstruction of data across nodes in case of partial failure of a limited amount of blocks on a node, without risking to be disqualified.

cdhowie · May 8, 2020, 6:44am

They are related, yes. I was saying that this:

is not correct in the context of what was being asked for. A different type of backup of a different type of data is in the works, but what you quoted doesn’t imply in any way that Storj is working on implementing a backup system for SNOs. (That might be in the works – I don’t know – but it’s not supported by the text you quoted.)

Alexey · May 8, 2020, 7:56am

It’s doing exactly that, it takes the missed data from the other nodes and pay them from your held amount (the repairing service is paid) and reconstruct the data to other nodes, but yours.

Why the customers should trust to node, which managed to lost their data? It doesn’t makes any sense.

pietro · May 8, 2020, 8:07am

Customers should trust the Storj network, not the single node. Is the Storj network that makes sure customers data is safe, not the single node. When a node uploads a block, this block is spread across several nodes, so from a customer’s point of view if a node loose a block there’s no data loss since the block is replicated, but you know that better than me.

Alexey · May 8, 2020, 8:23am

I agree that customers should trust to the Tardigrade network, not a single node.
I mean - your node can’t be trusted in the same range as more reliable nodes, which didn’t lost the data for the same amount of time.
However, you got the point - failed node can’t be trusted anymore. The network will tolerate for some level of losing the data, but when your audit score will fall below 0.6 it will be disqualified.

There is no replication any kind

SGC · May 8, 2020, 8:26am

Setting up your home setup for HA isn’t something that is very related to software… the main part is that you need double the hardware … because you want failover in case anything breaks.

sure there are some later stage considerations, but aside from having nodes in multiple locations, setting up for HA in a home network is a big task.

let say you do the Herculean work and actually setup your entire home network to a HA network.
do you also pay your ISP for those kinds of services, now you also have doubled the cost, if not more because at the very least you generally want two of everything…

and then even with double the capacity, you will most likely not benefit from it because you will want redundancy.

great now you are set… everything is working you and double failover on everything, full replication… so on and so forth…

and lightning strikes nearby taking out local power in the region, maybe even damaging your now atleast twice as expensive gear…

UPS you say… okay okay…
so what about flood damage, break ins, people digging in the local area and taking out the power…
sure as we move down the line things become more and more unlikely for some… but the point being that, making a HA setup isn’t really about software… its partially what i think makes Storj so great.

imagine the redundancy, so long as you are online the network should work… world get hit by a meteor on the top of every google datacenter … so what you are decentralized just need to log on… xD

If you want to approach a better setup, figure out what is the most likely cause to take down your node next time, and avoid it, or simply look back… whats taken it down in the past…or why hehe…

A.I calculates the biggest danger of storagenode downtime and removes it…

SNO vanishes.

The problem with HA is that if you live in a tornado area, flood area, brushfire area… and really it’s the problem you didn’t think off that will get you.

but HA like technology have existed for a while, i would believe that is partially the reason for stuff like multiple NIC’s in server, dual cpu’s, raid setups and put your nodes on the server, and if you want to be extra safe, then you run your storage separate, so that you can connect over external SAS cable, iSCSI
allowing you to quickly switch to redundant servers on the network or nearby.

but really storj isn’t going to sue you into the bowels of hell for loosing your node…
lets me realistic if you have run a node in a year or two, how much downtime have you really had…
most will barely be over a few days.
so 700 days in two years, so lets say 3.5/700 so 0.5%
so what are you really paying the extra cost for… to minimize the 0.5% odds of your node crashing… sure if you got 20 nodes… then the annual failure rate would be 10% …
and as high as 10% is… then you can essentially go a decade with a bit of luck, without hitting those odds.

so essentially you are willing to pay double upkeep the price, to reduce 0.5% chance of surviving each year… maybe taking it to 0.1% because there will always be something you didn’t think of…

(almost my currently planned setup - power here is so stable that i won’t need a generator)
a disk shelf with dual controllers, dual power supplies, a server dual cpu, multiple nics and hooked up to the internet over dual connections, maybe using fiber to mitigate electrical issues over the network.
and run vdevs in sets of 8 drives in raidz2 and 3-4 vdevs pr pool with zfs, and the gear placed in the middle of the building, the room made into a faraday cage, plastic coat over the room(for water protection), rack raised off the floor(water again), make sure there are no pipes in the walls, having a basic UPS for emergency shutdown, and a generator to takeover in extended powerout situations.

then you are most likely at 99.99% most issues would be software, SNO or external factors.
and if we look at it long term the raid array will be the first thing to die…
in lets say 10 years if it’s not over worked… you have double redundancy 8x of 4 vdevs… but lets say you have gone with 3, and instead the remaining 8 drives are cold + hot spares and you selected drives with less then 2% annual failure rate.

so 1 drive has 20% chance of dying in 10 years, so after 10 years of run time we will have spent … well 5x20% is 100 so per 5 drives and we got 24 in 3x 8 so 5 hdds of our spares for redundancy will have been used and hell we could even have lost twice that amount and survived. without the system needing maintenance

ofc after 10 years you or some such set factor you will run into the HDD wall, they are mechanical and keeping them running will eventually keel them dead after a set amount of hours depending on the particular drive production series.

high quality electronics running without to much load, basically doesn’t wear out, it generally takes decades or unstable power…

so lets say you setup this amazing system, ready to run for decades without maintenance.
then it would just be outdated in 5 years, or you get robbed before you show it off to the wrong person.
because maybe that was the thing one forgot to take into account… everything nature and physics could throw at you, but it ended up being a simple glance that was the most likely point of failure.

god this got long… well thats my 5$ on the subject… i know it was a bit in tune with what some of you already where saying… but really i don’t think being truly HA is realistic, it’s like RAID, you need to run massive setups for them to make sense… else you might just be better off running mirrors…
mirrors in zfs is so easy to work with… which is why many smaller setups will do that.
and really the storage solution is the one critical point of failure…

ofc thats just because hdd’s are mechanical, maybe we are better of just switching to SSD’s
they have basically internal raids and clusters which should allow them lifespans closer along the lines of all the rest of the electronics…

ofc then comes the issue with fans, they will also fail and needs monitoring, else one needs to go passive… tho fans tend to have better lifespans than hdds, if the fan isn’t filled by dust…

okay i’m done now… we could make some established guidelines on how to keep from disqualification or suspension storagenodes, because that is sort of the goal, and then say it’s better for a node to go into suspension rather than getting DQ, because suspension would only be temporary and serve as another additional warning for a inattentive SNO.

but really it becomes a cost vs benefit / profit calculation…
big corps or big ISP supplying big corps need HA because their downtime could represent millions in reparations to clients, and having hundreds of clients then you might be able to spend the super high costs of having modern HA.

i think ill be fine with what worked a decade ago for ISP’s and other such webhosts.
and it’s basically built into all decade old highend server grade gear…

Pentium100 · May 8, 2020, 9:02am

So, I guess “if you can’t have perfection, there is no point in trying at all”?

Here’s the problem - as far as I know, Storj requires uptime comparable to a datacenter (I have not seen the new uptime requirements, so I assume the 5 hours/month). Similar to a datacenter, I have some spare hardware and could replace a failed server or switch with something that would work, even though it would not have the same performance. However, unlike a datacenter, I do not have employees who can do that very quickly at any time. I may be asleep. I may be on vacation or anywhere else where I would not be able to get to my system so fast that I could fix whatever the problem is without getting the 5 hours of downtime. Even if I have a spare server and only need to move the drives to it.

So, I should design the system to be able to tolerate pretty much any failure in my absence. And not, I do not consider things like my house getting nuked, because then I really would not care about the node anymore.

Yes, you can run any software in HA without support of the software. Just use vmWare or similar and make the vm highly available. Awesome. Until the vm gets a kernel panic or similar.
Right now, if I want to use shared storage for my node, I can only use iSCSI, which, of course, makes it more difficult to have two VMs running (with the node active in one of them). I could use nfs or smb, but the database used does not work on that. Also, it is not possible to move the database to some other storage (even for performance reasons - to keep it on SSD). People have asked for that feature and were denied. At the same time it is not possible to have the db in a cluster.

Not allowing backups is a similar issue. Anybody can make a stupid mistake and delete the wrong file etc. Backups help here, however, according to Storj:

Why would a customer trust a node after it lost 10GB out of 10TB of data?
If you lose 10GB out of 10TB, just kill and replace the node, the customers will trust you with their data then.
So, essentially, losing 10TB of data (but giving the network $200 or so) is better than losing 10GB of data.

SGC · May 8, 2020, 9:50am

well its a new company, when they have lost customer data one day, their views will change… and they will want SNO’s to supply whatever data they can scrape back onto the network and pay them for it…
disregarding parity data like that is like an standart raid array throwing out a drive because the drive is acting up, meanwhile stuff like bitrot will get slowly crawl into the array.

i think the SSD thing isn’t really storj’s choice and its the future of storage, because chips are cheaper and more reliable than mechanics.

just need to ramp up mass production in the field, in a decade i don’t think we should expect HDD’s to be used… i mean you can get 60TB ssd’s in 2.5"

today so clearly the whole data density battle is long over and gone, all that remains is price, and don’t tell me silicon chips cannot be created cheaper than mechanical harddrive components, they are basically printed…

and power usage, not even a compertition, vibration resistance… well… i mean its difficult to name areas that SSD’s doesn’t beat HDD in, maybe rewrite ability and i’m sure there will be uses for HDD’s in the future, just like Tape backup is still alive and well in the most highend of computers today.
i do suspect HDD’s will mostly die off… they don’t have the area and cheap production cost of a tape.
and the only reason we used them was because we didn’t have anything better like NVM.
i do here SSD need power once every year or two… but i’m sure a super cap will go a long way…
maybe thats what hdds will end up being used for… decades of extended storage, tho from what i hear then they don’t really work for that either because the magnetic imprint degrades, tho i don’t really see that in my old drives… got a few 200mb ones that i’ve kept data on which are still good…

alas i digress…

i’m not saying don’t try… i’m saying that .5% really has to be worth it, or you are just doing it because you like the challenge… looking at it from an ROI perspective it doesn’t really make sense.
the primary point of failure will be the SNO doing stuff, the HDD’s failing or external factors…
electronics and good programming that aren’t manipulated can usually run for decades and decades…
ofc we are talking stuff that is made for that… you mobile phone fails quick because it’s made to be a race car… not a ye oldie tractor…

so again we are back at the whole performance vs reliability vs cost

Alexey · May 8, 2020, 7:53pm

4 posts were split to a new topic: SSD vs HDD what is better?

Pentium100 · May 8, 2020, 12:37pm

Single identity, shared DB (cluster), shared data (nfs, smb or ceph). If it was possible to load balance between the nodes, it would even help in cases with not enough CPU.

SGC · May 8, 2020, 1:01pm

couldn’t that actually just be extending the storj network function from the WAN / internet and into the LAN of the SNO, then its more of a identify administration and payment structure thing.

The system already does all this… then the SNO would just only be paid for the systems that work and so long as they don’t get DQ…

but what is Storj’s incentive for this… DQ of the entire cluster identify upon going over the allow corruption of data threshold …

i could see something like that be made with minimal changes to anything…

and if they share a raid cluster then it would be basically impossible to shut it down… lol mainframe style

Pentium100 · May 8, 2020, 1:10pm

If it becomes possible to use a compiled node (instead of the docker image), I may even try to do that myself. I managed to improve the performance of my v2 nodes (by not using kfs) and I could always use that as an excuse to learn go.

Cmdrd · May 8, 2020, 2:06pm

I guess the biggest point here is that this is a lot of effort for Storj Labs to implement for an incredibly small subset of SNOs. There are probably only a handful who are running large enough setups to really see any benefit out of this that it just isn’t worth it to develop.

cdhowie · May 8, 2020, 4:03pm

I’m pretty sure there are people doing this (running a node compiled from source). There’s nothing magical about the Docker image, it just eases distribution and upgrades.

Pentium100 · May 8, 2020, 4:11pm

At some point during the alpha stage IIRC someone asked how to compile the node from source and the answer was that you cannot - there were some settings required that were not published or something.
I may not remember it correctly though.

peem · May 8, 2020, 6:13pm

Can you explain how you calculated it?

cdhowie · May 8, 2020, 6:16pm

RAID5 reserves one block in each stripe for parity (redundancy) so one disk’s worth of storage is used for redundancy information.

If you have a 3x6TB RAID5 then you have 2x6TB = 12TB of usable storage (and 6TB of parity data). So you can offer 12TB of capacity to the Tardigrade network.

If you run a storage node per HDD and don’t use raid then you have the full 3x6TB = 18TB of usable storage to offer the network.

18/12 = 3/2 = 1.5

peem · May 8, 2020, 6:23pm

OKAY. You run 3 nodes, 3 months you wait for verification, you get the same amount of data for 3 nodes as for 1 node, the token storj loses value … I don’t see 1.5x the potential income …

cdhowie · May 8, 2020, 6:28pm

No, you run one node and you start an additional node once the first one becomes vetted OR full, whichever happens first. And so on for the next node.

Ingress traffic will be split between all of the nodes that aren’t full, yes. This is effectively a constant though and is the same for either approach, so it’s not a point for or against RAID.

STORJ token volatility is a separate and unrelated issue. As with the ingress traffic splitting, this is the same whether you are running 1 node or 3, so it’s not a point for or against RAID.

You don’t see the difference in being able to store 18TB of data for the network vs. storing 12TB?

You have a bit longer “ramp-up” time to full due to the additional vetting, but unless you expect your HDDs to fail within a year or so, you’ll wind up with more income in the long run.

Note that RAID5/6 can make sense to reduce administrative burden (creating and authorizing identities, monitoring, upgrades, etc.) which is a valid trade-off that each SNO would have to make. It really only becomes beneficial (IMO) when you have at least ten HDDs. I’m easily managing six nodes right now.

peem · May 8, 2020, 6:43pm

I know the pros and cons of both solutions.
The larger “surface” of the disk, however, does not give 1.5x the potential income … you will not get a larger …
unless we understand the word “income” differently
greetings