Put the node into an offline/suspended state when audits are failing

Pentium100 · August 16, 2021, 2:58pm

It’s important to me, no? If it wasn’t, I would not care if it was disqualified.

First, I have to figure out how to make the probem never happen again. I would not just be doing the same thing again (expecting different results :))

I was saying that before. Here’s how I look at things: figure out the requirements and then figure out how to meet them. One of the requirements (something I complained about a lot) was no more than 5 hours of downtime in 30 days, which also meant no more than 5 hours of continuous downtime. Also, the node sometimes would corrupt its database if shut down uncleanly.

To meet that requirement you do need redundant power supplies, UPSs, a generator and at least two uplinks. Reason being that nobody is in a hurry to fix residential power or internet when it fails.

However, since that particular requirement was relaxed, I no longer think that UPSs, generator and two uplinks are required, since even residential power and internet would be fixed well within 30 days. It would be possible to even buy parts like a power supply if one breaks.

Then maybe nodes should not be disqualified in 4 hours with no notification and no chance to get them back when the OS freezes? Pause the node, make the node stop itself, put it in containment, whatever, just give the operator time to figure out and fix the problem.

My problem with this is the disconnect between the recommendations and the requirements. The recommendations say one thing, but if I think how to meet the requirements, then I get something way above the recommended setup.

The worst thing about this is the required reaction time. Since I have RAID, the data should not be lost or corrupted unless multiple drives fail (or the node software deletes the wrong files), so I should be OK with leaving the server alone for a while. However, it looks like if a software problem or a certain type of hardware problem happens, I only have about 3 hours to react and fix it, otherwise my node would be disqualified without actually losing a single file.

Imagine this: your client wants you to set up and manage a server for him. Instead of buying proper hardware and hosting it in a datacenter somewhere, he gives you an old desktop with a single drive and says “it will be good enough, I won’t buy new hardware”. Well, it’s his choice so you set it up, connect it in his office somewhere and the server runs OK. You monitor the server, apply updates to it etc and he pays you every month for the service. After a while, the hard drive crashes or something else happens. The client is angry and demands compensation for his loss - 90% discount for the next year, then 65%, then 48%, then 12% and after four years he’ll pay you the full price again.

Would you consider such client to be reasonable?

Does that risk depend on me in any way? If not, I don’t care, free money is free money. If it depends on me, then I would rather make sure that the payments did not stop.

Also, if I buy $1000 of hardware, I’ll be able to use it for my needs as well or for something else if Storj as a project fails.