I was just driving to SFO. AI billboards everywhere. AI this, ai that, vomit inducing .com effect across the county. Inside the airport — massive flashing displays about the same shit.
This is my favourite topic 
Short answer: parts matter, but architecture matters more.
Reliability design starts with accepting that failure is inevitable. The question is where, how often, and what happens next.
Hardware:
Most components follow a bathtub curve. Early failures are manufacturing defects. The long flat middle is normal operation. End-of-life failures are wear-out. My policy of only buying used hardware and especially hard drives stems from the desire to stay as far away from the left edge as possible.
Capacitors, fans, HDDs, connectors, and PSUs dominate the tail. Semiconductors rarely die on their own if kept within specs.
Server parts are typically better binned and validated — tighter voltage and thermal margins, longer burn-in, lower allowed defect rates. That reduces early failures and variance. It does not change wear-out physics or make failure go away.
Software:
Most outages are software-induced. Updates, config drift, state corruption, operator error. I don’t update software unless I have to. On topic — TrueNAS Core is thankfully dead, so I can continue using it without being bothered by “updates”. The failures are known and documented. Workarounds are in place. It works.
Software doesn’t wear out, but complexity accumulates. Reliability comes from minimizing moving parts and state.
Resilience:
Assume things will fail.
Design should be focused on making those inevitable failures boring, localized, and recoverable.
Fail fast. Detect early. Avoid undefined states.
Redundancy:
Redundancy only helps if failures are independent.
That’s why I find RAIDZ3 outrageously stupid. If that many disks fail at once, you’re no longer dealing with uncorrelated failures. You’re dealing with a shared cause, and therefore piling on parity past RAIDZ1 mostly buys feel-good math, not actual reliability.
Recovery:
This is where systems actually become reliable.
Rebuilds you can repeat without thinking.
State kept to a minimum, ideally nowhere important.
Backups that have actually been restored at least once.
That’s also why I like TrueNAS. The boot drive does not matter. The entire system config lives in a single SQLite DB. Reinstall (remotely, via IPMI disk mounting feature), restore config, done.
The goal isn’t heroic uptime — instead you want a systems that degrade predictably and return to service without human creativity.
That’s very “boring”. And also satisfying.