Audit system improvement idea

Pentium100 · March 9, 2020, 5:29am

IMO this is a bit of a problem for the new users. For someone who understands, this makes Storj look a bit split (recommending one thing, while requiring another), but a new user might actually believe that, thinking that “yep, I can meet those requirements with the recommended setup”. At least there should be some sort of disclaimer that says something along the lines of “while the minimal setup does not exactly meet the requirements, it is still a good way to run a node, even if you will have to start over more often compared to using RAID, you will still make more money on average”. Now the new user would be making an informed decision to run the minimal setup vs a more expensive setup with RAID, redundant internet connection etc.

Toyoo · March 9, 2020, 9:13pm

I can think of two scenarios here:

There is a class of transient errors (ie. data errors that are not permanent) that, at scale of few terabytes, are bound to happen regardless of hardware unless your run a more expensive node with RAID and ECC memory. Consumer drives are rated at bitrot of 10¯¹⁴ (ie. on average one lost bit every 10 TB written). Some studies (see e.g. “DRAM Errors in the Wild: A Large-Scale Field Study” by Google folks) say non-ECC memory has bitrot of around 10¯¹¹ (ie. on average one lost bit every 10 GB). A single bit lost will probably invalidate a whole chunk of data. In my opinion, this rate of errors should not disqualify a node if Storj expects home users to participate and I assume now that this is handled by allowing a few lost audits before disqualification.

Another scenario is when an SNO knows of a disk failure and manages to migrate the data that is still correct to a healthy drive. SNO may lose, let say, few percent of data, but everything else might still be correct. I’d love to see Storj handle a situation where by manually running an operation, let’s call it scrubbing, after this kind of a recovery process, the node software would go over the storage directory, note which files are missing and repair them.

Pentium100 · March 9, 2020, 10:29pm

Or restoring yesterday’s backup after some problem with the db (though I see fewer topics about corrupt db, I guess Storj managed to make it more reliable) or some other problem. Though I guess this would not punish the operators enough. Better to recommend a simple setup and punish the operator when said setup fails to deliver datacenter-like realiability.

Derkades · March 9, 2020, 10:47pm

Thanks for bringing the topic back on-topic, that’s exactly what I was trying to say in the OP.

I like this idea. However, repairing does cost money, and I agree with what has been said earlier that it is not always in the interest of satellites to make this investment. On the other hand, node owners would probably like to spend a little money to make their node survive. I’m not sure what would be the best way of implementing this though.

Toyoo · March 10, 2020, 8:41pm

Well, if I understand correctly, it would be either spend money equivalent to repairing 5% of the data or spend money equivalent to repairing 100% of the data because the whole node gets disqualified.

Derkades · March 10, 2020, 8:54pm

True, but as a satellite you do not know if it is repairing that 5% once or if the node keeps sending all the data to /dev/null.

Then you might say “only allow a bit of data to be repaired every month”, which was my original suggestion (except deleting instead of repairing, which has no cost to the satellite!).

Toyoo · March 10, 2020, 9:09pm

Deleting should have exactly the same cost to satellite as repairing, because then the satellite needs to put that lost part elsewhere to maintain parity. The only benefit is that the satellite may delay this cost.