Blueprint: tracking downtime with audits

We had the power company replace poles on our road recently. No advance notice. Power off 3 x 8h. The fibre cable was in the way, so they cut it without telling the telco. No doubt this will happen again in the future.

I remember some nodes being disqualified for bad uptime before Storj found out that the downtime tracking service did not work correctly and was disqualifying nodes for much less than the 5 or 6 hours of downtime.

Where I live, home users do not get priority service for repairs, so, if there is some problem with my internet connection in the Friday evening, the ISP may not repair it until Monday or later. Electricity is a bit different, but it still could take a long time depending on that the problem is. That is not to say that any problem will take many hours to fix, just that the SLA my ISP gave me means it could take days.

A failed power supply could also mean a long downtime, unless I have a spare one or can quickly repair the failed PSU.

All that assumes that I am at home and not on vacation somewhere far away when the problem occurs.

On one hand, I understand that unreliable nodes are bad for the network, but I think that should apply more to nodes that are frequently unavailable instead of a node that has 1 hour downtime in the last 6 months but now there is some disaster and the node is offline for days.

Most home users will not be able to realize good uptimes, in many countries a downtime for 1-2 days per month would have to be considered normal
Also, if the SNO is traveling a few days downtime as a result of that would also be a normal situation
If all such cases are disqualified many home users will have to leave the networkā€¦

Yeah, thatā€™s the main difference between a real datacenter and a home ā€œdatacenterā€. I do not have employees that could look after the servers while I am on vacation. If something can be fixed remotely and I have an internet connection - sure, but not if the problem requires me to be physically present. At least servers have IPMI (so more stuff can be done remotely), but a raspberry or a normal PC doesnā€™t have it, though an IP-KVM device can be almost as useful.

Plus a smart plug if it completely hangs and needs a hard reset. Though thatā€™s obviously something youā€™d want to avoid as much as possible.

here usually all that stuff is very reliable, ofc that may not be nation wideā€¦ nor be the same in the future, but if the past is any indication the i like my oddsā€¦ so for me ill be looking mostly at hardware redundancy and maybe some failover solutionsā€¦ in case of OS issues, failing upgrades, hardware downtime and what notā€¦

might buy more servers, but may also just see if i cannot rig together some lower power usage device that i can hook a HBA to and then keep the hddā€™s in a Disk shelf so i can switch the controlling computer with little troubleā€¦
donā€™t really understand all that SAS interlinking voodoo yet, but seems like a very viable and efficient solutionā€¦ thats the long term plan anywaysā€¦ have to get more data for that to be viable thoā€¦
i really like the idea that the storage is separate from the server, might not be a very power efficient solution, but im guess that more down to the hardware inquestion, such a psu and controllers in the DASā€¦

lol my keyboard seems to have lost function in the AT key, weird ā€¦ maybe time for a reboot of this machineā€¦ turn out one looses alt gr if a RDP reconnects while in the backgroundā€¦ weird lol

@Pentium100 yeah i really like the ipmi function, even if itā€™s not often usedā€¦ long term tho i want do nearly complete failover setupā€¦ then the lower power failover can also be focused on long term uptime without powerā€¦ make it so the system can run and run and run without me doing anything ā€¦ no matter what really happensā€¦ besides from acts of god xD

@BrightSilence
actually the impi usually has a hardware watchdog which will hard reset if the os or whatever its setup to monitor, stops resetting its timerā€¦
or it can be set to do a hard resetā€¦ it has a nice selection of different options

Yeah, I worded that weird, I meant that for everyone without IPMI

i think i would just go with a software versionā€¦ ofc there might be a specific problem somebody is trying to solve, or suchā€¦ but if one wanted to do HA then the cheapest route i believe would be to included it in the system to begin with instead of adding the features laterā€¦

i looked into it and i can even make one of my cpuā€™s redundantā€¦ is expensive thoā€¦ basically just runs like a mirror on the whole thingā€¦ so either 50% performance loss or 1/3 ā€¦ i think it was 1/3 but that wouldnā€™t really make sense when only having 2 cpuā€™s ā€¦ ofc one rarely runs on full tiltā€¦ well not really a huge concern imoā€¦ so didnā€™t bother testing itā€¦

kinda cool thoā€¦ also works for the ramā€¦ so one can in lockstep mode or spare mode hot pull ram modules without the system losing even a byteā€¦

ofc that costs like 1/3 or 1/4 of memory capacity, pretty neat feature tho. with my setup i think i can pull like 4 modules ā€¦ 1 from each ā€¦channel groupingā€¦ bank DIMM

i also have the option of splitting my QPI bus into multiple channels, so that it can drop a channel for the cpu without suffering anything aside from a bandwidth limitationā€¦
and then there is the multiple NICā€™s on each of the multiple NIC controllers, the option of adding redundant PSUā€¦

i should really finish my HA Guide post ā€¦
kinda grinded myself to a halt on it because i got bored of itā€¦ but itā€™s been just sitting there waiting to get finishedā€¦ tsk tsk

anyways, all good comments, but really, uptime policing will chase away the broader home user community, which I believe are essential too for the project

Right so I meant not any of the users you are talking about. Nobody running on an rpi or nas is thinking HA. Nor should they. Itā€™s complete overkill.