Score calculation correct?

So i’ve just learnt that my storage node has been paused due to a score less than 0.6 for audit.

What I find interesting is that only a small number of failed audits for a host that has 99.99 up-time is being paused which makes me question, is that score calculation even correct?

Something doesn’t seem right and here’s an example:

http://localhost:14002/api/satellite/118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW | jq .data.audit
{
“totalCount”: 11906,
“successCount”: 11891,
“alpha”: 11.97473878476754,
“beta”: 8.02526121523242,
“score”: 0.5987369392383782
}

http://localhost:14002/api/satellite/118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW | jq .data.uptime

{
“totalCount”: 91665,
“successCount”: 90226,
“alpha”: 99.9999999999992,
“beta”: 2.108334792345044e-21,
“score”: 1
}

15 failed audits out of 11,906.
That’s a fail % rate of 0.13% not 40%.

Hello @Sasha,
Welcome to the forum!

You can read a design for score calculation:

Thanks @Alexey,

I appreciate the link to the design spec for the score calculation, however since I am not a developer this is gobbledygook to me.

From a top level overview, I have 11,891 successful audits and 15 failed audits. My observation is, this is good! However, these 15 failed audits out of 11,906 total audits (11,891 successful) seems to have caused my score to be below the 0.6 level and my node was paused (naughty naughty poor node), that doesn’t quite make sense to me.

Hopefully your team will be able to get to the bottom of this and clarify why this has occurred. This will also provide a good indication to other node operators what to expect.

PS. On that page: https://github.com/storj/storj/blob/64602c3007800995adc5da3a2180e2f4fe960933/docs/design/node-selection.md

Section " Rationale", both page links have a 404 error.

  • Reputation Scoring
  • Extending Audit/Uptime Success Ratios

The way that system works is that it’s much worse to fail several audits in a row. My guess is that those 15 happened all in a row making the score drop below the threshold.
The score is not a percentage of failure but rather represents a trustworthiness score. There is something to be said for acting quickly when audits failures happen in a row as it can signify data loss, but it may need a little tweaking.

Where can I get the script that allows me to check reputation?

Failing audits is severely penalized because mostly a SN fails an audit if the data has been deleted or changed and they are used for ensuring that SNs are keeping the data.

Unfortunately the system doesn’t distinguish if that’s intentionally or unintentionally.

There is no script, the reputation will appear in the new Storage Node dashboard once it’s released.

For now the only possibility is to query the Storage Node dashboard, API as @Sasha has written in the initial post message.

2 Likes

I think you mean severely penalized. Hardly means the opposite (almost not) :wink:

@naxbc You can find info on the dashboard api here

Is the team able to investigate and confirm if this was due to the end of July start of August data reset part of the new node version release?

That is the only explanation that makes sense to me.

However I am still concerned that only 15 audits out of several thousand triggered the node to be paused.

I suspect the node version change and reset something didn’t quite reset and I was left with 1.2~ TB’s of data.

The last reset was end of June. There was no reset at the end of July! https://github.com/storj/storj/blob/0ccae6b061d84759216715c645f448476a4fe16a/storagenode/storagenodedb/database.go#L427

That part works as designed. We expect that even with 1M storage nodes we can send each storage node a few audits per day. It is critical for the file durability that we detect bad nodes as quickly as possible in order to trigger repair in time. -> You can be a healty node for years but once your storage node is failing 15 audits in a row you will get paused. Failing 15 audits over time is fine. Each successful audit will increase the reputation and over time you can get back to a perfect score. If you collect more and more failed audits doing that time you will get paused at some point. Failed audits have an higher impact than successful audits!

Yes and No. If you file a support ticket we can unpause you. There is one know bug that lets a storage node fail audits even if the storage node has never seen the requested pieceID. This happens on my storage node as well. It is close to impossible to get 15 of them in a row. Its more like a constant stream of a few failed audits over time. We are not able to tell you if that happend on your storage node but there is an easy way to find that out. If we unpause you, you will get audits for old data (lets say data from July) and if they fail you will get paused again in a very short time. At that point we can be sure the issue is on the storage node side.

1 Like

Yes, you’re right.
Bad word.
I edited my post to correct it.

Thanks.

Thanks, the node was un-paused last night and I was viewing the number of audits. However I may not have the right syntax to capture failed audits. Do you have a sample I can use as I want to monitor this for the next few days and see if we can capture the issue.

Also feel free to post any additional info I can use to capture more logging on the storage node for troubleshooting purposes.

Also just a hypothetical, if the storage node was shut down for say 24 hours and is not online. Would there still be failed audits reports to that node?

You don’t have to monitor it. You can get your score from the satellite: Storage node dashboard API

No. You have to be online to fail audits.

1 Like

@littleskunk
What values of constants AuditLambda, AuditWeight now? And V if it is tot simple -1 or 1.
I try to imaging how fast node can be DQed.

Those links are accessible now.
Thanks for reporting t.

1 Like

You may be clearer about what those variables now that the data science documents are accessible.

Those values are important for knowing how fast you can be disqualified, but also the frequency that your Storage Node gets audited which depend on the data size that it’s storing.

The values chosen aren’t part of those design documents. I’m sure they are in the source code somewhere, but that would require a bit more digging.

and on the count of nodes in storj network. Anyway speed of audition i can get from logs.

Sorry, I should have been more explicit, I didn’t mean that the document contains the used values, they explain the math behind the mechanism.

The values are configurable.
I’m not sure which one is used in production, maybe some are defaults and others, they are not; the default ones can be found in the source.

Yes, but it also depends on some configuration parameters which determine the frequency on what the audit process runs and other factors like the speed on each connection, CPU load, etc.