Tuning audit scoring

littleskunk · February 10, 2022, 9:23am

This is correct as long as the repair service doesn’t step in. The audit job is doing random checks. The repair service is not. A storage node could get disqualified for losing 2% of the data. It just needs to be the oldest data that get repaired first. The satellite will see a high number of failed repairs and assume that the storage node has lost much more than 2% across its dataset.

One additional risk with the repair service is that multiple repair jobs can start at the same time. Let’s say 100 repair jobs start at the same time. It is unlikely that they all hit your node but it is possible. Now the 2% data loss will fail at the beginning and the 98% success is not completed yet. Successful repair jobs will take several minutes. I am concerned that this offset will get the storage nodes disqualified. The audit job doesn’t have this problem. Anyway, that might be a different topic for a different day. I just want to mention it here. The repair job will affect your calculations and most likely not in a positive way because the repair job doesn’t come with the great behavior the audit job has.

The timeout is still missing. Storage Nodes will get disqualified if the hard drive is just freezing. I had that situation on my system as well. The writability check will not notice it. The hard drive is still there. You can try to write a file on it. That operation will start but never finish and so the writability check starts but never fails.

We have written some unit tests here to simulate that situation: storagenode/monitor: add test verify readable & writable by nadimhq · Pull Request #4183 · storj/storj · GitHub
Note: The PR description is maybe not the best. The idea was that we have been waiting for someone to fix the code and to make that easier we provided some unit tests upfront. So far we didn’t find someone fixing the code.