i was down for like 8 hours a few days ago due to hardware upgrades and other issues i ran into after rebooting…
within 48hours i was back at 100% uptime for all satellites, the network seemed to barely notice i was gone… which is good… sometimes one will have downtime and we are also allowed to 6 hours pr month…
duno if that means that every quarter we may go offline for a day for upgrades… well i would say try to minimize downtime, else you might get suspended which is like a 6 week period…
but really when i upgrade i try to do it in chapters…
i got my primary task, the reason to shut down.
then i got all the secondary and usually a long list of various priority stuff that i kind of want to get done… and i look at them depending on time, if i’m down anyways… then its a matter of optimizing what to do, like my last 8hr downtime, went something like this…
remove my riser card, to move to low profile cards so i could use all my PCIe on the mobo
change my RAID card to 2x HBA’s
adjusting RAM placements to get from 800mhz to 1066mhz
install 5 new HDD’s, didn’t really require dt, but my backplane had been acting up, so
5a. change cables to my backplane.
5b. shuffle around my drives in the HDD bays, to try and eliminate other errors than a bad drive.
BIOS optimizations when rebooting a lot anyways
6a. attach a watt meter to verify if my BIOS configurations enabled my power saving features.
move my SSD drive to internal SATA
take off my CPU’s radiators and reapply thermal paste
reinstall OS, to move it away from the SSD which is also my L2ARC and SLOG (causing to much disk latency
remake my zfs pool to run with 64k volblock sizes.
10a. move my storj folder into it’s own dataset, so i get more options with zfs
went pretty okay, but my OS had been reconfigured since last reboot, so i couldn’t get online…
so skipped nr 8 because i was out of time, tried to reinstall to fix it, but ran into other issues by going that route, eventually figuring out it was my network settings which i had taken from automatic ip to static and that required additional lines of configurations in my interface configuration file.
tried to do 9-11, but backplane was busted so i couldn’t get 10 drives running on it.
then when i got back online my zpool was offline and didn’t want to mount, turned out the OS had created folders in its mounting point, which i had to delete before zfs would automount it.
was a bit of a mess and at the end i had nearly given up on getting it online that day…
but eventually everything lined up, and i got some critical upgrades done…
in retrospect i should have rebooted first to verify that my OS was operational before continuing.
No plan survives contact with the enemy, and upgrading critical infrastructure on a time limit is never fun, it will always be structured chaos… imo it’s about setting a few critical goals that one wants to try to achieve, but keep some outs in case one decides to backtrack and just get operational.
one thing corporations do is to split up storage and host, so that you can essentially have multiple hosts utilizing the same storage, and thus you can easily take one Host or OS down while another takes over, having things setup so that there are many options and secondary ways to solve the same problem if everything turns into a shitshow is critical…
my back up was to basically move my drives to a workstation, and leave the server down for the count… but who knows how that might have gone lol…
and then do your upgrades in smallest possible segments, mostly to help you eliminate and minimize the points of failure or troubleshooting to get back to being operational again.