Design Draft: Auditing 2.0

nat · August 7, 2019, 4:53pm

Following a discussion today about a new method of audit selection across nodes, the pull request with the Auditing 2.0 design doc has been updated! Feel free to check it out or leave feedback: https://github.com/storj/storj/pull/2703

BrightSilence · August 7, 2019, 5:31pm

It might be somewhere I’m overlooking, but I don’t see any mention of what to do when a piece/segment is deleted that is in the node reservoir.

nat · August 7, 2019, 8:05pm

Hello BrightSilence! Same as with the current system, during auditing we will check the database to see whether the segment has been deleted after trying to download all relevant shares. If it is deleted, we move onto auditing the next segment.

BrightSilence · August 8, 2019, 5:57am

That’s done AFTER trying to download all shares? Would that not cause a missing file error on the storagenode? We’ve been assuming that’s a sign of a failed audit.

nat · August 8, 2019, 2:43pm

Yes we check if the pointer has been deleted after trying to download all shares, because the pointer can be deleted even right after we start the audit. If the pointer is still there, we audit it with the shares we have. Yes, this will generate a missing piece error on the storage node, but if the segment was deleted, the error won’t result in failed audit.

BrightSilence · August 8, 2019, 3:16pm

Thanks for the clarification. I guess we’ll have to keep that in mind as long as we use parsing the log as an indication of audit failure. I assume the current approach selects a random piece that still exists and then audits it, but the new approach might increase the number of audits on deleted pieces since it selects a piece from the reservoir instead.
Alternative would be to remove a piece from the reservoir during a delete operation, but I doubt that would be worth the overhead. And I assume the upcoming SNOboard will make these log parsing methods to find failures unnecessary anyway.

nat · August 8, 2019, 4:11pm

We’re not exactly parsing the log to determine audit failure, but checking the returned error type if downloading a share caused an error. Yes the current approach will continue on to select a random stripe from a segment to audit. I’m also not sure if it would be worth the overhead to remove segments from the reservoirs during delete operations, but one of our first steps to implement this new auditing design is to create a simulation for this new selection strategy. This should help us determine how to handle deleted segments in a way that makes sense with the new system. We’ll keep you posted, thank you for the questions!

BrightSilence · August 8, 2019, 4:16pm

With parsing the logs I was referring to us SNOs using scripts like the ones linked here.

Not something storj should take into account when making choices I’d say. I was just trying to determine if we should expect higher failure rates with the new approach.

nat · August 8, 2019, 4:31pm

Ah okay! I don’t think higher failure rates should happen but we’ll keep an eye on it in testing for sure!