I just wanted to show how unrealistic the original 5 hour/month downtime limit is.
Yesterday my server froze, at about 03:00 and I only saw it when I got up at about 07:00, rebooted it and it seemed to work OK.
Today the server froze again, this time at 01:00 and I only saw it, again, at 07:00. This time the server did not boot up properly and I had to troubleshoot it. I managed to boot it after about an hour.
Apparently the problem is with the HBA, backplane or at least one hard drive - the backplane froze with all drives put in. Booting the server without the drives, then inserting them one by one made it work, though one slot did not appear to work - I used a different slot for the drive.
I have ordered a new backplane and HBAs but the problem can return before the new parts get here. meaning, for a while, my node may not be as reliable as it was.
I just wanted to share this as an example of how unrealistic the 5 hour limit was. Maybe in a datacenter they would have spare parts and somebody would be on-call at night to fix them.
Unless I (or someone else) uses ceph or similar for the data and runs the node in a cluster (the node software itself does not make that very easy though), these sort of problems are to be expected in the long run.
EDIT: My node worked normally with minimal downtime for about 16 months. I do not know the actual downtime, but let’s assume it is 1hour/month or less. So, for 17 months I got 17+11hours of downtime which makes it 28hours/16months or 19h46m/year or 99.77%, which is quite normal IMO.