When to move to new hardware

cpare · February 25, 2023, 9:41pm

I had a hardware failure that resulted in a few days of downtime before I could find a way to survive until rebuilding my server and am now sitting at 93% on my shoe-goo and duct-tape temporary solution
The replacement motherboard, processor, memory and boot disk have now arrived - I intend to use the same case and power supply.

My question is should I take the downtime now and drive my uptime even lower (I expect no more than a day of downtime) or do I wait until my uptime returns to 99% to avoid any risk of being delisted?

Knowledge · February 25, 2023, 9:51pm

Only you are really going to know what is best here since you understand your machine’s potential for failure and how well and how long it will take you to upgrade your hardware.

sembeth · February 25, 2023, 10:09pm

Personally, I would change it right now. Even with one day of downtime, you should be above 60% online score. If you fall below 60%, you will be suspended for 30 days, and during that time you will receive less data. But if you are above that, I don’t see a problem.

arrogantrabbit · February 25, 2023, 10:23pm

I may be misunderstanding your circumstances but if the hardware failure you are describing took out motherboard, memory, and drives — I would definitely not re-use the same power supply (as a primary suspect).

It’s also unclear why would you need a day of downtime here: Slapping parts together and pressing a power button should take an hour tops, no?

cpare · February 26, 2023, 1:57am

In this case a single SATA port died which happened to be the SATA port I had the boot disk on. With no spare SATA ports (I already have an additional SATA card) I needed to replace the motherboard (second 970A-DS3P on the current 8x AMD FX™-8310 Eight-Core Processor) - Moving to a new chipset required a new Processor, CPU, and DDR4 Memory, I opted for a new M2 SSD to free a SATA port. I don’t believe the power supply is the issue as this is the second 970A-DS3P I have had die - I expect they are dying due to age (2014).

I don’t expect a day of downtime either, but experience has taught me to plan for some type of Linux incompatibility I will have to work through - for example the current board requires the kernel option IOMMU=Soft to get USB on the front and back of the board to work.

The real question is if there is a critical limit I am approaching with my downtime and if I should stay in the current paradigm until I get the uptime back around 99% (to allow some downtime without getting delisted), or another hardware failure occurs making the decision for me.

Knowledge · February 26, 2023, 2:58am

You can go about 12 days offline and not get disqualified. If you are offline for a while before that point, you may get suspended on one or more satellite. But if you then leave your node online it will come out of suspension after a number of days.

arrogantrabbit · February 26, 2023, 7:47am

Would not adding (another) HBA card (such as LSI 9223-8i) be a better (and cheaper) option? This will expand your choice of drives to include (usually cheaper) SAS models. Generally, it’s best to avoid relying on onboard anything – those devices are often of poor quality and performance).

Ah, yes, it’s a chain reaction…

Make sure you can actually use M2 and SATA concurrently. Often devices share PCI lanes, so that if you plug in a card into m.2 port – one of the SATA ports turns off. AMD processors have plenty of lanes available, but that does not mean that the corners aren’t cut at every possibility – such as connecting everything to the south bridge, simply to avoid routing long traces to the cpu.

They should not be dying so quickly. Perhaps power supply is indeed to blame – perhaps it has bad ripple/poor regulation, that causes such premature failures? Modern (~last two decades) motherboards are pretty much immortal. I have a few that refuse to quit since 2012, and I don’t have a heart to recycle them, so I keep inventing excuses for them, like hosting storj nodes…)

It’s a good approach.

This is one of the reasons to avoid boards that had been engineered to price by overprovisioning, cut corners, and requiring weird workarounds. Switching off hardware IOMMU affects performance and defeats the whole point of having one, and should not be needed if the device’s routing is properly designed.

Have you looked into surplus stores/electronics recyclers, often you can buy 8-10-year old server motherboard with IPMI+CPU+heatsink+ECC ram, but what is more important, almost guaranteed lack of compatibility issues with linux/bsd, for under $200 total.

I would not worry abut downtime, nobody expects to have 6 nines of uptime on home networks.

cpare · March 1, 2023, 1:00am

Thanks so much for the feedback - decided to stay the course with the new hardware and plan to install it this weekend, I also took your advice and ordered a new power supply to make sure that’s not the root of the problem.