Setting up your home setup for HA isn’t something that is very related to software… the main part is that you need double the hardware … because you want failover in case anything breaks.
sure there are some later stage considerations, but aside from having nodes in multiple locations, setting up for HA in a home network is a big task.
let say you do the Herculean work and actually setup your entire home network to a HA network.
do you also pay your ISP for those kinds of services, now you also have doubled the cost, if not more because at the very least you generally want two of everything…
and then even with double the capacity, you will most likely not benefit from it because you will want redundancy.
great now you are set… everything is working you and double failover on everything, full replication… so on and so forth…
and lightning strikes nearby taking out local power in the region, maybe even damaging your now atleast twice as expensive gear…
UPS you say… okay okay…
so what about flood damage, break ins, people digging in the local area and taking out the power…
sure as we move down the line things become more and more unlikely for some… but the point being that, making a HA setup isn’t really about software… its partially what i think makes Storj so great.
imagine the redundancy, so long as you are online the network should work… world get hit by a meteor on the top of every google datacenter … so what you are decentralized just need to log on… xD
If you want to approach a better setup, figure out what is the most likely cause to take down your node next time, and avoid it, or simply look back… whats taken it down in the past…or why hehe…
A.I calculates the biggest danger of storagenode downtime and removes it…
SNO vanishes.
The problem with HA is that if you live in a tornado area, flood area, brushfire area… and really it’s the problem you didn’t think off that will get you.
but HA like technology have existed for a while, i would believe that is partially the reason for stuff like multiple NIC’s in server, dual cpu’s, raid setups and put your nodes on the server, and if you want to be extra safe, then you run your storage separate, so that you can connect over external SAS cable, iSCSI
allowing you to quickly switch to redundant servers on the network or nearby.
but really storj isn’t going to sue you into the bowels of hell for loosing your node…
lets me realistic if you have run a node in a year or two, how much downtime have you really had…
most will barely be over a few days.
so 700 days in two years, so lets say 3.5/700 so 0.5%
so what are you really paying the extra cost for… to minimize the 0.5% odds of your node crashing… sure if you got 20 nodes… then the annual failure rate would be 10% …
and as high as 10% is… then you can essentially go a decade with a bit of luck, without hitting those odds.
so essentially you are willing to pay double upkeep the price, to reduce 0.5% chance of surviving each year… maybe taking it to 0.1% because there will always be something you didn’t think of…
(almost my currently planned setup - power here is so stable that i won’t need a generator)
a disk shelf with dual controllers, dual power supplies, a server dual cpu, multiple nics and hooked up to the internet over dual connections, maybe using fiber to mitigate electrical issues over the network.
and run vdevs in sets of 8 drives in raidz2 and 3-4 vdevs pr pool with zfs, and the gear placed in the middle of the building, the room made into a faraday cage, plastic coat over the room(for water protection), rack raised off the floor(water again), make sure there are no pipes in the walls, having a basic UPS for emergency shutdown, and a generator to takeover in extended powerout situations.
then you are most likely at 99.99% most issues would be software, SNO or external factors.
and if we look at it long term the raid array will be the first thing to die…
in lets say 10 years if it’s not over worked… you have double redundancy 8x of 4 vdevs… but lets say you have gone with 3, and instead the remaining 8 drives are cold + hot spares and you selected drives with less then 2% annual failure rate.
so 1 drive has 20% chance of dying in 10 years, so after 10 years of run time we will have spent … well 5x20% is 100 so per 5 drives and we got 24 in 3x 8 so 5 hdds of our spares for redundancy will have been used and hell we could even have lost twice that amount and survived. without the system needing maintenance
ofc after 10 years you or some such set factor you will run into the HDD wall, they are mechanical and keeping them running will eventually keel them dead after a set amount of hours depending on the particular drive production series.
high quality electronics running without to much load, basically doesn’t wear out, it generally takes decades or unstable power…
so lets say you setup this amazing system, ready to run for decades without maintenance.
then it would just be outdated in 5 years, or you get robbed before you show it off to the wrong person.
because maybe that was the thing one forgot to take into account… everything nature and physics could throw at you, but it ended up being a simple glance that was the most likely point of failure.
god this got long… well thats my 5$ on the subject… i know it was a bit in tune with what some of you already where saying… but really i don’t think being truly HA is realistic, it’s like RAID, you need to run massive setups for them to make sense… else you might just be better off running mirrors…
mirrors in zfs is so easy to work with… which is why many smaller setups will do that.
and really the storage solution is the one critical point of failure…
ofc thats just because hdd’s are mechanical, maybe we are better of just switching to SSD’s
they have basically internal raids and clusters which should allow them lifespans closer along the lines of all the rest of the electronics…
ofc then comes the issue with fans, they will also fail and needs monitoring, else one needs to go passive… tho fans tend to have better lifespans than hdds, if the fan isn’t filled by dust…
okay i’m done now… we could make some established guidelines on how to keep from disqualification or suspension storagenodes, because that is sort of the goal, and then say it’s better for a node to go into suspension rather than getting DQ, because suspension would only be temporary and serve as another additional warning for a inattentive SNO.
but really it becomes a cost vs benefit / profit calculation…
big corps or big ISP supplying big corps need HA because their downtime could represent millions in reparations to clients, and having hundreds of clients then you might be able to spend the super high costs of having modern HA.
i think ill be fine with what worked a decade ago for ISP’s and other such webhosts.
and it’s basically built into all decade old highend server grade gear…