So i’ve just learnt that my storage node has been paused due to a score less than 0.6 for audit.
What I find interesting is that only a small number of failed audits for a host that has 99.99 up-time is being paused which makes me question, is that score calculation even correct?
Something doesn’t seem right and here’s an example:
I appreciate the link to the design spec for the score calculation, however since I am not a developer this is gobbledygook to me.
From a top level overview, I have 11,891 successful audits and 15 failed audits. My observation is, this is good! However, these 15 failed audits out of 11,906 total audits (11,891 successful) seems to have caused my score to be below the 0.6 level and my node was paused (naughty naughty poor node), that doesn’t quite make sense to me.
Hopefully your team will be able to get to the bottom of this and clarify why this has occurred. This will also provide a good indication to other node operators what to expect.
The way that system works is that it’s much worse to fail several audits in a row. My guess is that those 15 happened all in a row making the score drop below the threshold.
The score is not a percentage of failure but rather represents a trustworthiness score. There is something to be said for acting quickly when audits failures happen in a row as it can signify data loss, but it may need a little tweaking.
Failing audits is severely penalized because mostly a SN fails an audit if the data has been deleted or changed and they are used for ensuring that SNs are keeping the data.
Unfortunately the system doesn’t distinguish if that’s intentionally or unintentionally.
That part works as designed. We expect that even with 1M storage nodes we can send each storage node a few audits per day. It is critical for the file durability that we detect bad nodes as quickly as possible in order to trigger repair in time. → You can be a healty node for years but once your storage node is failing 15 audits in a row you will get paused. Failing 15 audits over time is fine. Each successful audit will increase the reputation and over time you can get back to a perfect score. If you collect more and more failed audits doing that time you will get paused at some point. Failed audits have an higher impact than successful audits!
Yes and No. If you file a support ticket we can unpause you. There is one know bug that lets a storage node fail audits even if the storage node has never seen the requested pieceID. This happens on my storage node as well. It is close to impossible to get 15 of them in a row. Its more like a constant stream of a few failed audits over time. We are not able to tell you if that happend on your storage node but there is an easy way to find that out. If we unpause you, you will get audits for old data (lets say data from July) and if they fail you will get paused again in a very short time. At that point we can be sure the issue is on the storage node side.
Thanks, the node was un-paused last night and I was viewing the number of audits. However I may not have the right syntax to capture failed audits. Do you have a sample I can use as I want to monitor this for the next few days and see if we can capture the issue.
Also feel free to post any additional info I can use to capture more logging on the storage node for troubleshooting purposes.
Also just a hypothetical, if the storage node was shut down for say 24 hours and is not online. Would there still be failed audits reports to that node?
You may be clearer about what those variables now that the data science documents are accessible.
Those values are important for knowing how fast you can be disqualified, but also the frequency that your Storage Node gets audited which depend on the data size that it’s storing.
Sorry, I should have been more explicit, I didn’t mean that the document contains the used values, they explain the math behind the mechanism.
The values are configurable.
I’m not sure which one is used in production, maybe some are defaults and others, they are not; the default ones can be found in the source.
Yes, but it also depends on some configuration parameters which determine the frequency on what the audit process runs and other factors like the speed on each connection, CPU load, etc.