Tuning audit scoring

So, I’ve been thinking. Given that:

  1. The node volunteers the “file does not exist” info to begin with and the satellite can’t reliably determine the difference between known and unknown failures if “creative” SNOs don’t want it to
  2. Suspension already takes care of protecting data on the node by marking pieces unhealthy and queueing segments below the repair threshold for repair
  3. Raising the lambda stabilizes the score around the actual percentage of missing data / failed audits

Why differentiate between known and unknown audits at all? Why not recombine them into one score, set a high threshold, like 97%. And have any type of audit failure hit a single audit score. When a node drops below that threshold, they get suspended and get grace period (say a month) to fix issues and recover the score. After that start a monitoring period (of say a week). If the node drops below the threshold in that week, disqualify them permanently.

This will result in about the following based on some early simulations I ran:

  • Nodes with actual data loss of 4% or higher won’t be able to get out of suspension to begin with and will be disqualified after a month. In the mean time they are suspended and repair has already kicked in. So delaying the permanent disqualification causes no additional harm.
  • Nodes between 2% and 4% data loss may go in and out of suspension during the grace period, but will likely still be disqualified during the monitoring period.
  • Nodes between about 1.6% and 2% file loss may or may not survive the monitoring period. It depends on luck.
  • Nodes with at most 1.5% file loss who got suspended with temporary issues get a chance to recover from that and will survive the monitoring period if fixed in time.

Possible downsides

  • Nodes that encounter temporary issues again during the monitoring period will get disqualified.
  • Node operators who just want to see the world burn could block access to data during the grace period, then allow access again shortly before the monitoring period and after that is done remove access again. However, if they do, they will spend most of the time in suspension, not getting any data and losing data to repair. It wouldn’t really do damage other than having perhaps a small impact on repair costs. And there is no upside to doing this as it requires you to store all data anyway.

Note: This isn’t entirely fleshed out yet and probably needs some refinement.

3 Likes