Design draft: New way to measure SN uptimes

Hi everyone,

Our initial attempt to calculate SNs reputations using based on audits and uptime reputation didn’t work out as we thought, for such reason, we disabled the disqualification due to uptime failures as it was the one which caused to disqualify several nodes without being clear how long they were offline.

We need to enable back the uptime reputation calculation of the SNs and disqualify those ones which aren’t online long enough for keeping our network healthy, for such reason we are working on designing a new way to assess uptime SN reputation.

The initial first draft has been sent (https://github.com/storj/storj/pull/2733) and we are going to discuss over it but also we would love to have your feedback.

Many thanks!

5 Likes

Thank you for such an awesome project and all the hard work everyone has put in this far. Based on my understanding of the network and available data I have provided my feedback below.

One of the greatest setbacks with the current method of SN uptime, and disqualification related to it, is that it does not take into account that all nodes are not equal. Factors that differentiate nodes from each other:

  1. Age of node
  2. Amount of data currently stored on the node
  3. Audit history reputation
  4. Network quality of service (is this a highly used node due to geographic location and response time?)
  5. Node escrow balance

While each of the above factors should not be weighted the same, they need to be taken into consideration when calculating the disqualification threshold. Setting the threshold too high does create a situation where certainty of available datasets becomes looser, while setting the threshold too low could cause extremely high node churn and could become costly for the network.

For an example, let us use “Node A” and “Node B” with configurations as follows:

-Node A-
Age: 6 months
Disk used: 20TB
Egress: 1TB

-Node B-
Age: 2 months
Disk used: 500GB
Egress: 30GB

Node A has much more at stake than Node B as they have a much greater vested interest to keep their node up “at all costs” in order to avoid becoming disqualified since Node A’s escrow balance is fuller; plus avoiding going through the full process of setting up and building ‘used disk’ on another node. There are some circumstances where availability of the node cannot be guaranteed by the operator, such as a regional network outage. Additionally, instantly disqualifying both nodes in such an instance creates a monetary loss for the network as it now needs to pay other nodes for 20.5TB of repair traffic; $205 in this situation. Unless enough time has gone by for Node A’s escrow balance to hold $200, this has now become a loss to the network.

With the above in mind, in the current model, since ingress traffic is ‘free’ (and there will be situations where free storage is provided to encourage network adoption), this sets the precedence for an attack on the network which encourages nodes to be set up while ‘free data’ is uploaded to the network and then dropping the new nodes; causing the remaining nodes to obtain monetary benefit through network repair traffic. The only workaround I can think to this solution is to prioritize ‘free storage’ traffic to older vetted nodes.

Clearly, the benefits of preserving nodes with high ingress/egress throughput rates and capacities do not need to be spelled out here. This needs to be calculated into the node disqualification threshold. (Note: throughput rate needs to be validated from the satellite side. Historical transfer information may be the best measure for total capacity. These two need to be mutually exclusive measurements.)

Since we see that there are several reasons to encourage ‘veteran’ nodes from both a network health and monetary protection standpoint I propose an uptime measurement system which has a dynamic threshold based on the above five factors. The downtime threshold should not reset from month to month, like the current system, but should have a trailing 45-day window. I would argue that, in the most extreme cases, it is reasonable to allow up to 48 hours of downtime for the most veteran nodes that have the best ‘stats’ before complete network disqualification. The following are the recommended SN uptime calculation parameters:

i) As the age of a node increases, so does the probability that the node will come back online. Greater age should increase the disqualification threshold linearly.
ii) The greater amount of node used disk means it is in the networks best interest to try and retain this node from a monetary standpoint. This should increase disqualification threshold linearly.
iii) ‘i’ should be factored against ‘ii’ to determine the cost to perform a network repair.
iv) Audit history reputation should reduce the threshold exponentially since data integrity is in question.
v) A “desired” node Mbps up/down and total bandwidth capacities desired should be determined. If node values equal the established baseline, no change factor. If below the baseline then the disqualification threshold is reduced based on how far off the baseline; if above, the threshold is increased based on how far above the baseline.
vi) Taking all of the above into account, each node should have an established monetary “downtime cost per minute” attached to it.
vii) Once a node is “down” it starts losing time from its disqualification time threshold.
viii) The remaining disqualification time is calculated on a trailing 45-day window.
ix) Once a nodes disqualification time pool has been depleted, the value from ‘vi’ should be deducted from the nodes escrow account and absorbed by the network until the escrow account balance becomes 0.
x) Once a nodes remaining disqualification time and escrow balance become zero the node is permanently disqualified.

Best regards,
Nathan

7 Likes

@ndragun welcome to our forum.

Great thanks for your elaborated feedback.

At first glance, it looks an interesting approach at how we should consider several factors for disqualifying Storage Nodes but also they could also be considered for calculating their reputation.

We are going to have a discussion about your feedback and indeed, consider what you have proposed, however, I’m afraid that we won’t be able to tackle all of those at once.

From our side, we’ve already discussed that it’s hard to determine if we should disqualify a Storage Node because of just being offline above a maximum allowed time and we started to consider if for now we should only gather such information and do a post-analysis.

We are still in discussions about this and I’m going to bring up your feedback to our next discussions in order of assessing them and see how/when we could plan to have a more intelligent disqualification system.

2 Likes

With the risk of suggesting what you are already considering, may I suggest using the mentioned variables as well as all other variables a satellite collects about storagenodes in a statistical model to predict the chance of the node returning or being gone for good. You could then either set a threshold on that chance and disqualify the node after a set minimum when that chance drops below the threshold.

One step beyond that you could multiply that chance by the cost to repair all data on the node to calculate the opportunity cost of disqualifying a node and set a threshold on that. When the opportunity costs drop below the threshold, and the minimum timeframe has passed, DQ.

If you want to go even further you could train another statistical model to predict the average time it will take for a node to return and the cost associated with the repairs needed to be done while it is offline. This cost could be subtracted from the opportunity cost.

Scoring these models should be fast and straight forward and could be done every hour a node is offline after the minimum time for disqualification is reached.

This approach will have a lot of the effects as outlined by @ndragun without having to program specific rules around all relevant variables. Furthermore, it would be less based on assumptions but based on what the data has actually shown to be true.

3 Likes

Beautiful, a lot of respect for a well thought-out reply. This is something I would think of as a perfect system!

After some internal meetings where your feedback has been considered and very appreciated we decided that we are not clear enough how to impact the SNs offline time to the node reputation.

We decided that, first, we have to collect the downtime data and then our data science team will have more insights into how we can impact downtime to the SN reputation and disqualify nodes from being offline too much.

Because of that, the initial design document has been turned down and a new design document, which only designs how to track the offline time, is now in draft. You can see it at https://github.com/storj/storj/pull/2857

Thanks again for your awesome feedback!

2 Likes

According to the new draft, the downtime is recorded in two places - number of seconds offline in one table and count of success/failures in another. Any particular reason for that?

I hope that the rechecking of offline nodes is done quite frequently, so a 5min downtime is not recorded as a hour. However, if offline and online nodes are checked with different intervals, then the “uptime success/total count” would not really mean anything, since an offline node would rack up the count faster/slower than an online one.

According to the new draft, the downtime is recorded in two places - number of seconds offline in one table and count of success/failures in another. Any particular reason for that?

  1. We want to have some historical data of offline detecting rather than having a single number that sums up the total number of offline seconds of each SN
  2. nodes table already exists in the Satellite and it’s already accessed too frequently. We aim to reduce the load of the database or at least spread it somehow that we could scale better.

I hope that the rechecking of offline nodes is done quite frequently, so a 5min downtime is not recorded as a hour.

That isn’t happening with the current design. It’s tracking the exact offline time or less, but not more than the one that we are certain.

However, if offline and online nodes are checked with different intervals, then the “uptime success/total count” would not really mean anything, since an offline node would rack up the count faster/slower than an online one.

Between them, yes that’s correct. But that’s already happening because we are not checking the SNs homogenously, however, those numbers will vary more.

We are assuming that SNs are behaving correctly if we detect one is offline, then it isn’t behaving correctly, so we constantly recheck.

OK, I read it multiple times and finally got it. Seems to work OK except for one case where a node comes online and then goes down again just before the next check.
Though if the offline nodes are rechecked frequently enough this would not be a problem I guess.

Seems to work OK except for one case where a node comes online and then goes down again just before the next check.

We know that the first 59minutes 59seconds an SN can be offline and we won’t track any offline time for it.
This is because we assume that SNs behaves correctly and the SNs has to contact the satellite every 60 minutes (this is caused by the kademlia removal).

An SN could cheat in that way and being offline forever, however, I don’t find a reason why an SN wants to be offline without being detected, considering the incentive that it has for serving data.

Apart from that, any SN which is doing so is at the risk of being caught by the audit service and then updates its last_contact_failure; at that point, some offline time will be registered until it does the next planned satellite contact.

Though if the offline nodes are rechecked frequently enough this would not be a problem I guess.

Yes, if we increase the frequency, then we’ll have more accurate offline time.

I was just thinking, there are 2 down sides to nodes being offline for a while. Availability of data and cost of repair. I understand that you need to disqualify nodes that have proven unreliable because it could harm availability. But this seems to be more of a risk with repeat offenders. Say a normally good node is offline for 2 days. If that has happened only once, there is not much risk for data availability, but it could cause significant cost for repair. For normally reliable nodes, it could be possible to allow for longer down time if the costs could be compensated.

So you could work with 2 thresholds, one to determine whether compensation should be taken from escrow (escrow would need to be rebuilt by withholding a % of future payouts). The other would be to determine whether the node can be considered reliable going forward and would lead to disqualification if the answer is no. Optionally you could put the nodes in vetting phase after this to limit future risk and take some time to reassess that the node can be trusted again.

This could give generally reliable nodes much more time to recover and would prevent having to repair lots more data because a returning node was disqualified too soon. And I think SNO’s will understand the penalty in payouts and it will take over the role as an incentive to stay online all the time.

3 Likes

Let me add my two cents to handling of planned downtime: This month, my ISP has been announcing already a second emergency downtime of their infrastructure. An emergency one, yet planned, required to replace some backbone equipment I guess. In both cases they reserved 3-hour windows, but the actual outage of connectivity can be much shorter than this. It would really be beneficial to be able to specify planned windows for maintenance, which does not necessarily mean the node would be off-line for the whole time, but the network should expect legitimate unavailability of that node and, e.g. postpone any writes to it.