Hi everyone, we are working on designing a plan for how to handle what we call “unknown” audit errors. In the past, we treated them the same way as normal audit failures, which caused a lot of nodes to get disqualified for simple configuration issues, rather than because they were trying to cheat.
Looks good to me, although the terminology is all a bit awkward. “unknown audit reputation” and “under inspection” are a bit vague and don’t really cover what’s actually going on. I can’t really think of anything better to describe it though. If this is really usually caused by node misconfiguration, perhaps you could use something like node health or reliability. It’s not much clearer, but at least it would differentiate it a bit more from existing audit reputation. I’ll let you know if I think of something better.
I did have one other question that wasn’t completely clearly covered. Would nodes under inspection count towards the RS repair threshold? Say if 36 pieces of a segment remain, but 2 of them are on nodes under inspection, would the repair be triggered? If not, not allowing GET_REPAIR on nodes under inspection could cause the number of pieces available for repair to drop below the minimum threshold in rare situations.
I agree that the terminology is awkward. Another thought instead of “under inspection” would be “soft disqualification”, but I don’t know that it makes things clearer. I would love to change it to something else, because writing the document has been awkward because of this.
I did have one other question that wasn’t completely clearly covered. Would nodes under inspection count towards the RS repair threshold? Say if 36 pieces of a segment remain, but 2 of them are on nodes under inspection, would the repair be triggered? If not, not allowing GET_REPAIR on nodes under inspection could cause the number of pieces available for repair to drop below the minimum threshold in rare situations.
Yes, these nodes will affect the RS health. They are treated just like disqualified nodes as far as the repair service is concerned. A big reason for this change is actually to preemptively repair segments. For example if we have repairThreshold+1 pieces in a segment that are online + not disqualified, but two of those nodes are always returning errors every time a piece is requested (like an unknown error during an audit), we want to repair that file, and our perception that the segment is healthy (above the repair threshold) is actually inaccurate. Does that make sense? The most significant part of this is that we are not accidentally claiming that segments are more healthy than they actually are.
I agree with “suspended” or “paused”, it suggests there is something wrong but it is only temporary if you do something about it.
I believe the word “paused” is currently used for disqualification, which doesn’t make sense because disqualification is permanent and “paused” suggests otherwise.
I had some other thoughts for consideration. The design doc mentions that there will be a time out and if you don’t get your score up above the timeout, your node would be disqualified. This seems reasonable enough, but it might cause a node that has been “fixed” by the operator to get DQ’ed before it had time to recover. Might I suggest a small tweak that only disqualifies nodes on new audit failures or unknown audit failures. That way if the underlying issue is resolved the node still gets a chance to recover as long as it succeeds on all audits until it’s back above the threshold.
I would also like to recommend using email notifications. Ideally running a node would eventually be a set it and forget it type of thing. Email notifications could notify the operator of issues before they become a problem. They could also be used for normal audit failure by for example informing SNOs when the audit score drops below 0.9 or 0.8.
I like the concept. I agree that “under_inspection” is a bit awkward. Maybe something like “under_review” status or “action_required” status? Or “unknown_err_hold” ?
What about a system that rewards additional recovery time if the node shows signs of working properly again? Let’s say the time-out period is 48 hours. As soon as the node is placed in the “under_inspection” state, the clock starts counting down. If the node starts returning successful audits for those that were previous returned “unknown” time gets added to the clock. This way progress towards fixing the node will add time preventing the node from being disqualified if it is fixed just prior to the time limit expiring. The sat could also try one last time when the clock hits zero so every opportunity is given to the SNO to fix the problem. The amount of time added could be a multiple of the average time between audit requests if the time interval is short. Let’s say average is 5 minutes, each successful audit for a previously “unknown” audit adds 10 mins. When the clock gets back to 48 hours worth of time (or whatever threshold), the node can leave “under_inspection” mode. This could work in concert with other criteria.
To prevent the node from extending this indefinitely by keeping the clock within the 0-48 hr range, a hard limit could be placed that disqualifies the node if the problem is not fixed, and even if the node is making progress. In this example it should be at least as long as it would take for the clock to count down from 48hrs to 0, and potentially recover from 0 to 48hrs, so 48 hrs + 48 hrs/10 mins + a reasonable leeway time.
Of course all of these values would need to be adjusted to fit the network’s needs for balancing node churn and durability. I think for this to be a fair process, the SNO needs to be notified in some way, email being the most logical choice.
While I like the concept of giving more time every time the node succeeds with an audit, I don’t think such a complicated solution is necessary. Good nodes should never fail audits. Let me clarify, they should never have outcome 2 (failure) or 5 (unknown). So if a node is indeed fixed, those should not appear anymore. Furthermore, outcome 4 (containment) should not go as far as to result in repeated issues resulting in a failed audit. If the grace period for “under inspection” is over and either of those three outcomes appear, the node apparently wasn’t fixed and could be disqualified immediately. If none of those come up, the score will only go up and the node should be allowed to recover despite the time being expired. In this case, the node is most likely healthy now and can be allowed the time it needs to recover under the condition of a no tolerance stance on any kind of failure until the score is above the threshold. Because the node will still be audited, it can’t remain in this state for long anyway. Either the score will go back up or it will fail an audit and be disqualified. Additional time limits shouldn’t be necessary.
I agree that good nodes should never fail audits, but isn’t the point of this state to give a chance to nodes that have been returning failed audits for “unknown” reasons that may be beyond the SNO’s control?
Unless I am misinterpreting, in this scenario the node will have a fixed amount of time to remedy their issue, and once they succeed in returning a successful audit for a piece that previously returned the “unknown” audit, they would then be given an unlimited amount of time for their inspection score to recover. Then if they have an additional audit returned as “unknown” before their inspection score recovers, they should be immediately disqualified? Or does the clock start again if another “unknown” audit is returned? Otherwise this assumes a one-and-done type fix for the unknown audits.
In a scenario where we don’t pause the clock after a single successful audit of a previously “unknown” audit, there could exist a condition where the time allowed to be “under_inspection” expires before the inspection score recovers, as the score will take time to get back above the threshold. This is what I was trying to get around with the time added idea. I don’t think it is particularly complicated outside of selecting the appropriate values (my intention was for fixed time values to be used, not necessarily dynamic). The sat would only have to have the ability to add time to a countdown timer which is already ticking down anyway.
As I think the intention of this functionality is to give SNOs a reasonable chance to fix a problem in good faith, the criteria shouldn’t be so strict as to DQ a node immediately for continued returns of “unknown” audits. Of course since the regular audit score would be continuously calculated during this time, a node that starts returning failed (but not “unknown”) audits during this time would be DQ’d as normal.
The problem (in my opinion) of just giving a fixed amount of time for a SNO to remedy this specific “unknown” audit issue, is that if you give 48 hours to fix the problem and I discover the problem on hour 47, there may be no time left to fix the problem. And since the “unknown” audit condition is by definition, “unknown” there may be compounding issues which prevent immediate remedies.
It’s posible that I may be trying to fix a theoretical problem that can’t actually exist, or that may not come up often in practice, but then again this problem of the “unknown” audit should also not come up very often (or at all) in practice. I just feel like good faith should on the SNO side should be rewarded with good faith from the Storj side.
I think we’re trying to get to the same thing by different methods. I would argue that it’s important to be clear to SNOs how long they have to remedy the situation. Extending time frames just make that more confusing. I would say the time allowed in “under inspection” should be the amount of time you get to fix the problem. It should be enough time to diagnose, get in contact with support and ask for help on the forums. We can argue about how long this should be, but that’s simply a matter of configuration and doesn’t impact the technical implementation. If the issue is resolved within this time, your node should be fine. If not, it should be disqualified. Simple and clear.
If it was not fixed, the node will keep running into unknown audit failures. If these happen after the reasonable time frame has expired, that should be the end of your node. The reason I include more than just unknown failures here is because you do not want to give the node the option, to opt for failing hard instead of unknown to try and work around the disqualification.
If it was fixed, there are 2 options. Either your node has already recovered to have an unknown audit score above the threshold or it hasn’t. If it has, you’re all good with the currently suggested design. My suggested addition was simply for nodes that were fixed but were still below the threshold to give them time to recover. We both agree that nodes in good health don’t show these failures, so we can let them recover until they fail or get an unknown failure response. Because if those happen, the issue clearly wasn’t resolved and DQ is justified. The only change you would have to make is to not disqualify the node when the time window runs out, but when they fail their next audit (while still being below the threshold).
Now there may be cases where there is a bug in the software that causes the issue and the SNO is waiting on a bug fix. I would suggest giving support the option to extend the “under inspection” window in those cases, instead of having this done kind of arbitrarily if there happens to be one successful audit.
I really appreciate all the discussion that has been going on! I have some thoughts on some of the points that have been discussed, but please let me know if I missed something important.
Name: I think of all the suggestions, “suspended mode” was the best. I plan to update the document accordingly today.
I think this is a great idea - so in order to go from “suspended” to “disqualified”, two conditions must be met: the suspension threshold has passed, and the node has had some sort of audit error.
Regarding the discussion between @baker and @BrightSilence about rewarding nodes with more time for successful audits when they are suspended, I agree with @BrightSilence generally. If the node operator fixes the issue, they should not have to worry about unknown or audit errors. Additionally, I hope when this feature is initially implemented, we provide plenty of time to diagnose and fix the issue - 48 hours is probably on the very low end of what we’d want. I was thinking a week or so to start while we are figuring things out (but this is definitely something worth discussing). With 48 hours, I would understand having some leeway if the operator is working till hour 47, but with a week, I don’t imagine this is something we need to worry about as much.
I agree, with a week of time there would be no reason to add leeway. I think most would agree that a week is quite generous. I had made an assumption that the time limit for nodes in the suspended state would be rather short since the allowable downtime (5 hours, but not enforced at time of writing) is already an very small time window.
I am not sure if being generous is something we need to worry about too much, especially if we can avoid losing a node. The main reason I think this is because suspended nodes will be treated like disqualified nodes by the repairer, so we will already be working on replacing them with healthy nodes, even during the grace period. So as far as the repairer’s perception of segment durability goes, 48 hours vs. 1 week makes no difference to durability. Suspended is just as “unhealthy” as disqualified.
Okay. I may have misinterpreted the intent here then. Since the suspended node is now being treated like a DQ’d node by the repairer, does this mean once the node is back online and healthy it will retain what’s left of it’s repuation, vetting, and escrow?