I think that Storj sometimes forgets that the node operators are regular people, some of them even running their nodes on the “recommended hardware” (raspberry Pi with a single HDD connected over USB) and not companies operating servers in a Tier 3 datacenter. This disconnect between recommendations and requirements is something I have noticed a long time ago and have commented about in the past.
The biggest difference between the two is not the hardware and not the reliability (as seen, even a RPi can be pretty reliable), but the response time, especially since the node software does not support running in a cluster or using HA.
If I rent a server in a proper datacenter, I can pay additional money for 24/7 support. They will have an emplyee nearby in case of some emergency and problems will be solved in an hour at most or whatever the contract says). It is, IMO, very unreasonable to ask of a regular person, who does not have employees and probably spends a lot of time every day away from the servers (at work), even goes on vacation. In case of any problem with the node, the operator should be given sufficient time to fix the problem before permanently disquaifying the node.
So far, of the failure modes, some are handled, IMO, correctly:
- An offline node is allowed sufficient time to return (unlike previous, unreasonabe requirement of no more than 5 hours of downtime).
- In case of “unknown” audits, the node is suspended and, it looks like, allowed enough time to fix whatever the problem is.
- In case the hard drive disappears (USB connector fell out or something), the node shuts down and refuses to start.
So, now we have regular audits. As I understand it, they can fail in threee different modes - “file not found”, corrupted data and a timeout. None of these, especially the timeout definitely prove that the node has lost data. It is very likely that “file not found” is because the data is lost, but it could also be that the USB cable fell out, or the file was deleted by the network 0.1s ago, or the file had an expiration date. Corrupted data most likely is because the hard drive is failing, but it could also be due to some other reason.
While the previous two have high chance of being because the data has been lost, timeout does not mean that. The node could be overloaded or something. Yes, it is a problem and the operator should fix it, but not in 4 hours.
Here’s what I think should happen:
- Satelite issues an audit request to the node.
- Node tries to read the file and encounters an error it can detect (file not found) or an error it cannot detect (corrupted data or timeout).
- If the error is of a type that the node cannot detect, the satellite informs it that there was a problem.
- If two problems happen in a row, the node marks this in the database or creates some file to mark it and shuts down.
- The node refuses to start, unless the mark is (manually) removed - this is to prevent auto-restart scripts from just restarting the node.
- Node operators gets an email informing him of this problem (that his node is offline).
The node operator can then fix the problem and restart the node, hopefully there are no more problems. If there are, the node will fail more audits, shut down again and eventually be disquaified.