Design Draft - Refactor pending audits to allow for increasing the number of audit workers

jennifer · July 27, 2022, 3:39pm

This is a working draft but would like your feedback: https://review.dev.storj.io/c/storj/storj/+/8041

A node must successfully complete a certain number of audits to pass the vetting process. As more nodes join the network, the vetting process for each node may take longer because the satellite is limited by how many total audits it can perform. We need to be able to scale auditing depending on how many new nodes recently joined.

We can’t safely scale the number of audit workers with the current implementation of the containment system. The core issue is that there can only be one piece pending audit for a node at a time. We would like to refactor the pending audits system to solve this issue and increase code clarity.

BrightSilence · August 8, 2022, 12:43pm

I’ve had this open in a tab ever since you posted it, but never got around to doing an in depth read until now. I guess the upside is that there is now quite a bit of feedback from other Storj people on the page as well.

See this redash graph of how long it takes nodes to get vetted https://redash.datasci.storj.io/queries/2941/source#4683

Is this something you could share? The link isn’t public. From my own experience the three customer satellites take about a month and a half and the other three all over 6 months currently. So I see the urgency on speeding up that process.

If each item results in a successful response (not offline or timeout), continue. Else, skip until the next iteration of the chore?
This may give the node too much leeway in terms of getting away missing data by timing out.

If a pending audit was attempted and reverification count is increased, don’t try it again for x amount of time. Add field for last_attempted timestamp

Depending on the timeframe, this would spread out failures evenly. Currently the audit scores have a short memory and rely on bad luck of consecutive failures to drop all the way below the 60% threshold. Spreading out failures is currently an excellent way to avoid disqualification, even if the node has suffered significant data loss. For context currently even nodes with 15% loss would survive audits when they are randomly ordered. If the errors are spread out, it wouldn’t surprise me if that nearly doubles.
I think the solution to this has already been discussed here: Tuning audit scoring - #72 by thepaul

The suggested changes we arrived at there put the threshold much closer to the percentage of failure that is actually and make the memory for the score longer. With that set up, the order of failures doesn’t really matter any more, as the memory of the score is long enough for spread out failures to still be remembered.

I guess what I’m saying is that implementing those changes is probably a requirement, before implementing the pending audit refactor. Otherwise you open yourself up to node operators abusing this change by just timing out audits if they don’t have the file and using the spreading out to survive much worse data losses than otherwise possible.

That said, some spreading is required. If you collect stalled audits for a while and then try them all at ones and fail all audits at the same time, you get artificial clustering, which would cause the opposite problem of clustering failures together and disqualifying the node too soon.

A possible solution is to not group the audits by node id at all and just tackle them randomly. Though this solution is still not ideal as the different processes would still cause artificial spreading or clustering depending on how much work there is in both processes and how fast regular audits and reverifies are performed in relation to each other. Implementing the audit score changes would help in any scenario.

paul cannon
Jul 29
I’m going to suggest that we add AND NOT contained to our default node selection query. With this new change, nodes should only be in containment for about as long as they are offline. It’s in everybody’s interest not to send new data to a node which we think is offline.

I think this is a good suggestion. It was my understanding that containment isn’t necessarily for offline nodes, but for irresponsive online nodes. These nodes will almost certainly also not respond to customers though. Not sending them new data, doesn’t just prevent uplinks from encountering more failures, it also gives a stalled node a break to possible recover from being overloaded.

If I remember correctly, auditing new unvetted nodes is prioritized by separate processes. Scaling of those should probably depend on the number of unvetted nodes and the average time in vetting. However, scaling for normal audits, should depend on the number of nodes and amount of data on them. This is probably out of scope for this blueprint. But I was curious what the ideas around that scaling were? Is this going to be a manual process or automated in some way? And will it differentiate between audits for unvetted nodes and normal audits?

Sorry for the slow response on this. But I hope this can still be helpful.

jennifer · August 8, 2022, 5:50pm

Thanks for reviewing, @BrightSilence! I’ll read through all your suggestions and questions and get back to you as soon as I can. In the meantime, here is a screenshot of the redash graph since redash is only available internally. It shows the percentiles of (node vetted date- creation date) grouped by the month a node was created (on us1). eg the nodes created in July had a minimum of 25 days between creation and vetted. Since there still may be nodes that were created in July that still haven’t been vetted, we won’t see those on the graph yet.

jennifer · August 8, 2022, 7:49pm

Days from node creation to vetted.

BrightSilence · August 9, 2022, 7:03am

Thanks for posting these images. It seems I was pretty close based on my own estimations, but it’s good to have hard numbers. I’ll use that to adjust my earnings estimator a little. In that last one I think the numbers for 2022 aren’t entirely reliable yet because lots of nodes haven’t yet been vetted that started in 2022? Correct?

It might be useful to plot based on vetted date, rather then start date. This would take out the drop off near the end because it only counts fast vetted nodes, because the slow ones aren’t done yet. Not sure is that date is readily available though.

jennifer · August 9, 2022, 5:54pm

Yes you’re right that the 2022 values are as reliable yet.
I do have that data! Here are the days from node creation to vetted based on the vetted date.

Eu-n1, SLC, and US2 aren’t getting new data uploaded to the new nodes except for repair data. With less data, they are selected for audits less frequently. There isn’t a separate process to audit new nodes. We select 5% new nodes for uploads so the proportion of audits going to new nodes will be about the same.

jennifer · August 9, 2022, 6:05pm

In response to your original message, that’s a good idea to implement the audit score calculation changes before this one. Also, good thinking to not group the audits by node id just select them randomly.

In terms of how we’re going to scale the number of audit workers (both ones for regular audits and reverify audits), we were planning to adjust them based on metrics manually. Perhaps in the future, we will implement a more automated process. Differentiating between audits for unvetted and vetted nodes might be a consideration too.

BrightSilence · August 9, 2022, 6:14pm

I’ve been looking for where I thought I saw that. I think I have a false memory of reading a blueprint on this. Searching the forum I found only this.

I think I may have initially thought how this would help nodes in vetting by ensuring they would get at least a minimum amount of audits and that later translated in my brain to this false believe. Gotta love how our brains take shortcuts.

The first two also hold a lot of data, so random audits are less likely to hit new nodes with very little data.

I really appreciate you sharing these numbers. They are closer than I was expecting. But look slightly more in line with what I’ve seen on my nodes recently. Though customer satellites seem to vet a little faster now, probably because of increased ingress.

Yeah, that makes sense not to overdesign it right away. With larger scale, that might change though.

And retroactively making my false memory come true. That’s one way to fix that.

jennifer · August 9, 2022, 6:30pm

No problem! I can see how that could have been interpreted to mean new nodes have a separate audit process.

And yeah, I’ll make sure to add your suggestions for future improvements to the design document so we can come back to that when it’s needed. Thanks for all your comments and suggestions!