I appreciate the link to the design spec for the score calculation, however since I am not a developer this is gobbledygook to me.
From a top level overview, I have 11,891 successful audits and 15 failed audits. My observation is, this is good! However, these 15 failed audits out of 11,906 total audits (11,891 successful) seems to have caused my score to be below the 0.6 level and my node was paused (naughty naughty poor node), that doesn’t quite make sense to me.
Hopefully your team will be able to get to the bottom of this and clarify why this has occurred. This will also provide a good indication to other node operators what to expect.
The way that system works is that it’s much worse to fail several audits in a row. My guess is that those 15 happened all in a row making the score drop below the threshold.
The score is not a percentage of failure but rather represents a trustworthiness score. There is something to be said for acting quickly when audits failures happen in a row as it can signify data loss, but it may need a little tweaking.
That part works as designed. We expect that even with 1M storage nodes we can send each storage node a few audits per day. It is critical for the file durability that we detect bad nodes as quickly as possible in order to trigger repair in time. -> You can be a healty node for years but once your storage node is failing 15 audits in a row you will get paused. Failing 15 audits over time is fine. Each successful audit will increase the reputation and over time you can get back to a perfect score. If you collect more and more failed audits doing that time you will get paused at some point. Failed audits have an higher impact than successful audits!
Yes and No. If you file a support ticket we can unpause you. There is one know bug that lets a storage node fail audits even if the storage node has never seen the requested pieceID. This happens on my storage node as well. It is close to impossible to get 15 of them in a row. Its more like a constant stream of a few failed audits over time. We are not able to tell you if that happend on your storage node but there is an easy way to find that out. If we unpause you, you will get audits for old data (lets say data from July) and if they fail you will get paused again in a very short time. At that point we can be sure the issue is on the storage node side.
Thanks, the node was un-paused last night and I was viewing the number of audits. However I may not have the right syntax to capture failed audits. Do you have a sample I can use as I want to monitor this for the next few days and see if we can capture the issue.
Also feel free to post any additional info I can use to capture more logging on the storage node for troubleshooting purposes.
Also just a hypothetical, if the storage node was shut down for say 24 hours and is not online. Would there still be failed audits reports to that node?