Just had 7 hours of downtime. 4 hours yesterday

I just wanted to show how unrealistic the original 5 hour/month downtime limit is.

Yesterday my server froze, at about 03:00 and I only saw it when I got up at about 07:00, rebooted it and it seemed to work OK.
Today the server froze again, this time at 01:00 and I only saw it, again, at 07:00. This time the server did not boot up properly and I had to troubleshoot it. I managed to boot it after about an hour.

Apparently the problem is with the HBA, backplane or at least one hard drive - the backplane froze with all drives put in. Booting the server without the drives, then inserting them one by one made it work, though one slot did not appear to work - I used a different slot for the drive.

I have ordered a new backplane and HBAs but the problem can return before the new parts get here. meaning, for a while, my node may not be as reliable as it was.

I just wanted to share this as an example of how unrealistic the 5 hour limit was. Maybe in a datacenter they would have spare parts and somebody would be on-call at night to fix them.
Unless I (or someone else) uses ceph or similar for the data and runs the node in a cluster (the node software itself does not make that very easy though), these sort of problems are to be expected in the long run.

EDIT: My node worked normally with minimal downtime for about 16 months. I do not know the actual downtime, but let’s assume it is 1hour/month or less. So, for 17 months I got 17+11hours of downtime which makes it 28hours/16months or 19h46m/year or 99.77%, which is quite normal IMO.

8 Likes

I also find the 5 hours per month unrealistic, my ISPs had several failures this year with downtimes >6h each.

1 Like

I could assume that you uses a some kind of RAID and datacenter-like setup.
I can share my experience.
My Raspberry Pi3B+ working fine for 20 months. Yes, it is with expansion card from Geekworm and in their box. It has only one HDD. Yes, it’s USB. It doesn’t have any UPS. It’s living in the sofa in a guestroom.
I have flashed the SD two times - first time when I assembled it and the second time a few months ago (I used SD for swap, it’s was a nice experiment, but something goes wrong and it stopped to boot. The downtime was about a half of hour, because it’s worked when I noticed that I can’t run any new command and it was not boot after reboot).

My Windows docker node working fine for 14 months. The Windows host just a Windows 10 Pro with Hyper-V. The host has UPS.
The downtime is happened a few times after upgrade the docker desktop from 2.1.0.5 to the latest version and I rolled it back. There is no problems after that. This node uses one HDD too.

My Windows GUI node working fine for 15 months (it’s started as a docker node and then transformed to the GUI when it’s become available). It doesn’t have any significant downtime which I can remember.
Perhaps only when the host was physically moved from the one location to the other ~2-3h a two years ago.
This node uses one HDD too. The host is the same as a docker node.

1 Like

IMO, you got lucky. I have seen regular PCs used as servers with no RAID and working normally for years.
I also have seen a new server with new SSDs mess its own system partition in a very interesting way resulting in no boot and the need to restore the system partition from backups.

There have also been posts by node operators who used USB hard drives and something made the drive disconnect (bad cable, bad controller etc), which resulted in very fast DQ (because the node started failing audits).

My point is that even a more datacenter-like setup is not immune to hardware failure or some other problem that may result in long downtime once in a while. And I suspect that a lot of node operators, unlike a datacenter, do not have staff who can fix the server 24/7 or a stock of spare parts. For example, it would have been even worse if I was on vacation.

My other downtimes were mostly ISP-related (it takes a few minutes to switch to the backup line).

If the backplane fails again before I get a new one, I think I can build some kind of abomination to run the node but that would mean some more downtime. If the old 5h/month limit was enabled, my node would be toast right now and I do not think I would create a new one.

1 Like

That’s the important thing. If SNO lose their nodes and their funds too easily they will leave and never come back because it would just not be worth it.

4 Likes

Well, if, because of this (hardware problem that I cannot really predict or prevent*), I lost the node, I probably would not create a new one. I mean, it would take at least a year to get back to the original state (the same amount of data etc) and I would have no guarantee that the same problem is not going to happen again, maybe next time a HBA would fail or something (if I use a backplane without an expander).

*The only way to prevent this type of problem is to run the node in a cluster. The node software does not make that too easy (because it does not like nfs mounted storage) and the cluster would require even more hard drives.

I have an old dell that I have been running 18 months or so now with little to no downtime mostly upgrading hard drives, There is no need to have anything “server” grade to have decent uptime the more hardware you have the higher chances of something failing. If my node dies tomorrow I really lost nothing because it costed me nothing except the drives themself. I would just plug the drives into another system and keep rollin.

But if the drives were connected to a server though raid this isnt possible less you had another raid card with the same config or everything is lost if hardware died and drives didnt.

1 Like

Something can work reliably for years an still fail at the most inconvenient time. Your setup is not immune to that, neither is mine. Yes, if my server dies, but there are enough working drives remaining I can move them to another server or set up some weird temporary solution.
What if your (or my) server failed while you (or me) were away on vacation?

This is why I think the 5 hour downtime limit is an unreasonable expectation from a home node operator (compared to a datacenter).

If a client wanted no more than 5 hour downtime, I would set up a cluster so that the service would be accessible even if a server failed.
Doing this for a Storj node is, IMO, a bit of an overkill though. However, it would be the only way to ensure that a hardware failure does not take down the node.

1 Like

Its not that I don’t agree with you its just your talking about spending tons more money on server grade hardware to run in a cluster. Its just not realistic solution Just don’t create such a large node in the first place so you don’t have so much to loose if it does fail that is the real solution.

I am just saying that a cluster would probably be the only way to guarantee the 5 hour downtime, which is why the requirement of 5 hours is unrealistic.

As for running multiple nodes, well… If I run them all as VMs inside the same server there is no difference if the hardware dies. If I set up a bunch of separate servers for separate nodes (I may have some space remaining in the rack), then I might as well set up a cluster and use that for other things as well.

Yeah your completely right 5hour downtime is not realistic for us as a regular joe Its not like we all have large UPS with backup internet in case isp desides to do some upgrades or something or a line is cut with an uncontrollable outage. Were not setup like datacenters would be.

But I did read some of the downtime tracking that will be put into place hopefully its not going to be a DQ if you pass the downtime limit of 5 hours instead just suspended node once it comes back online you will get back.

Regarding Downtime maybe this explanation can bring a bit light in the dark regarding Uptime and its scoring:

2 Likes

I have a big UPS and a second line (no generator yet though).
However, I am not immune from hardware failure. Also, I do not have employees who would resolve various problems quickly. If I am away when something really bad happens, well, there is nobody to fix the problem.

Well then you better hire a friend to come by when your on vacation then, Just kidding of course or am I…
But if your worried about hardware you should either do some upgrades to your current server because Im just guessing your running an older server hardware.
Me personally I don’t run a node on my server because I do not trust server hardware cause there designed to fail because all the hardware can be hotswap while server is running. Id rather run a node on a Free pc I got on the side of the street.

You’re completely wrong here. Enterprise hardware and servers in particular are designed to run even if something fails. That’s why we have hotswap power supplies, fans, drives, redundant network and hdd links, switch stacks, ecc memory and even memory mirroring.
And then you have initial owners who run the three year-long QA test and sell you hardware that survived it at a fraction of the list price.

Yeah, I have ordered a backplane that does not have an expander in it. Expanders can sometimes cause trouble, I have never been a fan of them, but sometimes I do not have that much money to spend on hardware and have to buy what’s available.

However, I have seen a new server mess up its system partition enough that it could not boot and had to be restored from a backup. So, while new hardware is more reliable than old hardware, there are no guarantees.

Server hardware is better than PC because it is designed to run 24/7 and be used in places where reliability is important. Stuff that can be replaced while the server is running (power supplies, hard drives, fans in some servers) is redundant and I would much rather have two PSUs (connected to separate UPSs) than one.
PCs, on the other hand, are designed to be used for part of the day and if a PC crashes or reboots, well, it’s annoying, but not that bad.

EDIT: Also, this is the first failure of the server since 2019 March. Before yesterday, the hardware has always been reliable and downtime was due to the time it takes to switch between uplinks.

Almost everything if your raid controller dies you will be dead in water least my server doesn’t have a backup raid controller in parallel. So if that dies everything goes with it.

Lets assume you do have employees who can resolve problems on 24/7 staff. They will not have a replacement motherboard or PSU on hand perhaps. Or even if there’s a multi-disk failure may need to overnight one.

I don’t think home users should be expected to keep a higher SLA than the average datacenter. How long does it take a datacenter to replace a hard drive?

So far as RasPi node, i’ll gie example, somehow during snow storm the fiber cable to my house was broken. I normally have a very solid symetric gigabit/gigabit fiber. It has been online over 600 days before this fiber break. It took them 4 days to fix it.

That can be done actually. I do not know about hardware RAID (which I don’t like and don’t use), but with software RAID you can have multiple HBAs connected to the backplane (multipath). The backplane obviously has to support it, but I’d rather have “simple” backplane (no expander). With it, I could distribute drives among multiple HBAs so that if a HBA dies, the array can continue in a degraded state.

Yeah but my server can’t use software raid cause it only has hardware raid, and there is no ports for my motherboard to plug drives into even if I wanted to run software raids I couldn’t. Probably should have invested in a server with more ports… But I don’t know how you could go from a hardware raid to software as a backup that seems like it would be a pain.