Score calculation correct?

Sasha · September 8, 2019, 11:05am

So i’ve just learnt that my storage node has been paused due to a score less than 0.6 for audit.

What I find interesting is that only a small number of failed audits for a host that has 99.99 up-time is being paused which makes me question, is that score calculation even correct?

Something doesn’t seem right and here’s an example:

http://localhost:14002/api/satellite/118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW | jq .data.audit
{
“totalCount”: 11906,
“successCount”: 11891,
“alpha”: 11.97473878476754,
“beta”: 8.02526121523242,
“score”: 0.5987369392383782
}

http://localhost:14002/api/satellite/118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW | jq .data.uptime

{
“totalCount”: 91665,
“successCount”: 90226,
“alpha”: 99.9999999999992,
“beta”: 2.108334792345044e-21,
“score”: 1
}

15 failed audits out of 11,906.
That’s a fail % rate of 0.13% not 40%.

Alexey · September 8, 2019, 1:19pm

Hello @Sasha,
Welcome to the forum!

You can read a design for score calculation:

github.com

storj/storj/blob/64602c3007800995adc5da3a2180e2f4fe960933/docs/design/node-selection.md

# Reputation and Node Selection

## Abstract

Node selection is the process wherein the set of all possible storage nodes is reduced by the satellite for uploading segments.  Node selection applies to new file uploads via an uplink, as well as repair traffic from a satellite.  The node selection processes endeavors to fairly distribute upload traffic among storage nodes.  Node selection takes into consideration how new a node is, the overall performance characteristic of a storage node as characterized by its reputation score, and the IP address of each node.

## Background

The white paper section 4.15 describes a 'preferences' system used in node selection, based on reputation:

> After disqualified storage nodes have been filtered out, remaining statistics collected during audits will be used to establish a preference for better storage nodes during uploads. These statistics include performance characteristics such as throughput and latency, history of reliability and uptime, geographic location, and other desirable qualities. They will be combined into a load-balancing selection process, such that all uploads are sent to qualified nodes, with a higher likelihood of uploads to preferred nodes, but with a non-zero chance for any qualified node.  Initially, we’ll be load balancing with these preferences via a randomized scheme, such as the Power of Two Choices, which selects two options entirely at random and then chooses the more qualified between those two.
>
> On the Storj network, preferential storage node reputation is only used to select where new data will be stored, both during repair and during the upload of new files, unlike disqualifying events.  If a storage node’s preferential reputation decreases, its file pieces will not be moved or repaired to other nodes.

The existing reputation-like system uses uptime and audit responses.  It does not currently consider geographic location, throughput, or latency.  In addition to factors which affect reputation, there are other factors in node selection.  These considerations currently include IP address, advertised available bandwidth, advertised available disk space, software version compatibility, and whether the node appeared to be online in the latest communication with the satellite.

One final factor involved in node selection is node 'vetting.'  During upload


## Design

This file has been truncated. show original

Sasha · September 9, 2019, 2:22am

Thanks @Alexey,

I appreciate the link to the design spec for the score calculation, however since I am not a developer this is gobbledygook to me.

From a top level overview, I have 11,891 successful audits and 15 failed audits. My observation is, this is good! However, these 15 failed audits out of 11,906 total audits (11,891 successful) seems to have caused my score to be below the 0.6 level and my node was paused (naughty naughty poor node), that doesn’t quite make sense to me.

Hopefully your team will be able to get to the bottom of this and clarify why this has occurred. This will also provide a good indication to other node operators what to expect.

PS. On that page: https://github.com/storj/storj/blob/64602c3007800995adc5da3a2180e2f4fe960933/docs/design/node-selection.md

Section " Rationale", both page links have a 404 error.

Reputation Scoring
Extending Audit/Uptime Success Ratios

BrightSilence · September 9, 2019, 7:54am

The way that system works is that it’s much worse to fail several audits in a row. My guess is that those 15 happened all in a row making the score drop below the threshold.
The score is not a percentage of failure but rather represents a trustworthiness score. There is something to be said for acting quickly when audits failures happen in a row as it can signify data loss, but it may need a little tweaking.

naxbc · September 9, 2019, 10:22am

Where can I get the script that allows me to check reputation?

ifraixedes · September 9, 2019, 11:55am

Failing audits is severely penalized because mostly a SN fails an audit if the data has been deleted or changed and they are used for ensuring that SNs are keeping the data.

Unfortunately the system doesn’t distinguish if that’s intentionally or unintentionally.

ifraixedes · September 9, 2019, 11:57am

There is no script, the reputation will appear in the new Storage Node dashboard once it’s released.

For now the only possibility is to query the Storage Node dashboard, API as @Sasha has written in the initial post message.

BrightSilence · September 9, 2019, 7:20pm

I think you mean severely penalized. Hardly means the opposite (almost not)

@naxbc You can find info on the dashboard api here

Sasha · September 9, 2019, 9:47pm

Is the team able to investigate and confirm if this was due to the end of July start of August data reset part of the new node version release?

That is the only explanation that makes sense to me.

However I am still concerned that only 15 audits out of several thousand triggered the node to be paused.

I suspect the node version change and reset something didn’t quite reset and I was left with 1.2~ TB’s of data.

littleskunk · September 10, 2019, 11:39am

The last reset was end of June. There was no reset at the end of July! storj/storagenode/storagenodedb/database.go at 0ccae6b061d84759216715c645f448476a4fe16a · storj/storj · GitHub

That part works as designed. We expect that even with 1M storage nodes we can send each storage node a few audits per day. It is critical for the file durability that we detect bad nodes as quickly as possible in order to trigger repair in time. → You can be a healty node for years but once your storage node is failing 15 audits in a row you will get paused. Failing 15 audits over time is fine. Each successful audit will increase the reputation and over time you can get back to a perfect score. If you collect more and more failed audits doing that time you will get paused at some point. Failed audits have an higher impact than successful audits!

Yes and No. If you file a support ticket we can unpause you. There is one know bug that lets a storage node fail audits even if the storage node has never seen the requested pieceID. This happens on my storage node as well. It is close to impossible to get 15 of them in a row. Its more like a constant stream of a few failed audits over time. We are not able to tell you if that happend on your storage node but there is an easy way to find that out. If we unpause you, you will get audits for old data (lets say data from July) and if they fail you will get paused again in a very short time. At that point we can be sure the issue is on the storage node side.

ifraixedes · September 10, 2019, 12:13pm

Yes, you’re right.
Bad word.
I edited my post to correct it.

Thanks.

Sasha · September 10, 2019, 10:30pm

Thanks, the node was un-paused last night and I was viewing the number of audits. However I may not have the right syntax to capture failed audits. Do you have a sample I can use as I want to monitor this for the next few days and see if we can capture the issue.

Also feel free to post any additional info I can use to capture more logging on the storage node for troubleshooting purposes.

Also just a hypothetical, if the storage node was shut down for say 24 hours and is not online. Would there still be failed audits reports to that node?

littleskunk · September 10, 2019, 10:50pm

You don’t have to monitor it. You can get your score from the satellite: Storage node dashboard API (v0.19.0)

No. You have to be online to fail audits.

Krey · September 11, 2019, 11:59am

@littleskunk
What values of constants AuditLambda, AuditWeight now? And V if it is tot simple -1 or 1.
I try to imaging how fast node can be DQed.

ifraixedes · September 11, 2019, 1:49pm

Those links are accessible now.
Thanks for reporting t.

ifraixedes · September 11, 2019, 1:55pm

You may be clearer about what those variables now that the data science documents are accessible.

Those values are important for knowing how fast you can be disqualified, but also the frequency that your Storage Node gets audited which depend on the data size that it’s storing.

BrightSilence · September 11, 2019, 2:25pm

The values chosen aren’t part of those design documents. I’m sure they are in the source code somewhere, but that would require a bit more digging.

Krey · September 11, 2019, 2:55pm

and on the count of nodes in storj network. Anyway speed of audition i can get from logs.

ifraixedes · September 11, 2019, 3:24pm

Sorry, I should have been more explicit, I didn’t mean that the document contains the used values, they explain the math behind the mechanism.

The values are configurable.
I’m not sure which one is used in production, maybe some are defaults and others, they are not; the default ones can be found in the source.

ifraixedes · September 11, 2019, 3:27pm

Yes, but it also depends on some configuration parameters which determine the frequency on what the audit process runs and other factors like the speed on each connection, CPU load, etc.