To add a bit of color around the guidance that there is one node per hard drive.
The most common way pieces are lost in our system is nodes going offline. By far and away, this is the most common problem, and disk or hardware failure is much less common relative to this.
When a disk or hardware does fail, it’s safe to assume that all of the data on that disk is at risk. Maybe not lost, but certainly now suspect.
The way our auditing system works is it does spot checks. If a spot check fails because the node simply isn’t online, that’s the common case and handled differently than if the spot check fails because the node is online but has a drive error or returns incorrect data. We expect incorrect data to be exceedingly rare (bit flips) and drive errors to also be pretty rare relative to nodes going offline. That’s why audits can so quickly disqualify a node - if a node starts to fail audits with drive errors or bad data but the node is online, we assume the hard drive is going bad and as an overall system we begin the process of repairing the data to other drives.
If you put more than one drive on a node, our system doesn’t have a way of understanding that half the data might still be good. The only thing that can be assumed to be at risk is the node itself.
Increasing your node’s reliability doesn’t make a lot of sense in the expected value sense because the system as a whole already has redundancy built in - it takes more resources to run RAID and it doesn’t actually make the overall system any more reliable. Better to run more nodes than to run fewer but more reliable nodes.
Having more nodes per drive isn’t as bad as having more drives per node, but it does give you additional overhead you could have avoided, and it could reduce the overall throughput of the system (you get less dedicated IOPS per node and more disk seeking/thrashing). I see that it reduces some disqualification risk if you’re able to save half of the data, and that’s fine. Really, the guidance here is to direct people towards a good rule of thumb (one hard drive per node) so that SNOs know what the designers of Storj are targeting and assuming.
Hopefully as we continue to optimize the software, the storage node software won’t be as CPU heavy and we can reduce the CPU and RAM requirements per node some.