Uptime check in Dashboard

Please read the design document I have linked. There is no clock ticking!

1 Like

Uptime Check SN Selection

This design doc doesn’t change the SN selection process for uptime checks, therefore, the current implemented mechanism with some minor modifications to classify those nodes which fail the uptime check in order of being recheck by the second part of the algorithm.

Uptime Recheck Loop

This process should run independently from any current satellite process (i.e. being a service) and it should run in a configurable time interval.

The algorithm for each time interval iteration is the following:

  1. Select the first row from failed_uptime_checks table which has back_online and disqualified columns set to NULL sorted by last_check in ascending order. If there is no row, ends (the process will be executed in the next interval).

It seems to me that by asserting 1 in the initial INSERT statement for an SN, that a node would begin with at least one failed_uptime_check if that node was inserted before it was actually started by the SNO.

You are quoting a design document that is not implemented yet. There are some informations in that document about the current situation.

The problem with the current uptime SN reputation score calculation is that it isn’t based on the actual time that a node is being offline

That is not possible. To trigger that insert statement the storage node operator has to start the node. There is no connection between the certificate authority server and all satellites. Only the storage node can trigger the insert statement.

1 Like

OK…

As far as a clock ticking goes, the unimplemented design doc clearly shows that time is measured between increments of the uptime_count column per node row in the satellite database. I would define this behavior as a ticking clock

As far as the unimplemented design doc goes, the assertion of an initial 1 in the uptime_count column is most unfortunate in the implemented released source code.

A clock has a fixed interval. The counter has no interval at all. It can be one success per hour vs 100 failures per minute. There is no clock. It is just a counter.

Yes that is correct and that is how it should be. Beside the fact that the counter is useless ofc. The idea here is that the first checkin counts as uptime check.

Real time clocks are counters. There is no difference between a clock counting seconds and a clock counting an incremented loop variable.

If true, then I am correct in my statement above that the “clock is ticking” on nodes whether those nodes are actually started or not, if the node is inserted into the satellite database as per the currently and quoted source code.

“should be” … I would strongly disagree with that value judgement, especially considering that Uptime is a listed requirement for running a node and is listed as a future metric used to disqualify a node’s identity.

I would argue that that’s not true. A clock can be stalled, halted or even skip “ticks”, it would still be considered as a clock, but not a reliable one. And I think that’s what we have here…

Ok so lets call it a clock with a random increment than. At any time it may or may not increase by a random number and at no point it can give you a useful value or meaning.

nevertheless SNOs see how the uptime score falls during a satellite update. Even if the SNO does not update its node.

I guess the same will occur if the connectivity from the satellite to the SNOs where interrupted. Of course the reverse is true as well and of course this does only take into consideration connectivity between a single SNO and satellite.

I am talk not about single node and node about single SNO.

I’m sorry - I don’t understand

Things going crazy with my nodes, WHY up time check is not accurate, all my nodes below 98% and 99.0%

1 Like

my uptimes are really bad. 2 are 50% and 2 are 78% … but that is 100% not true. Is there a way these will ever reset? or what can we do to improve them? It seems that mine are stuck there, it’s been 3 days since the update and i don’t see any change in any of the numbers, not even a 0.1% …

The uptime calculation is based on the number of attempts at pinging the storage node from the satellite. So, if the satellite pings less often, the uptime for a given node will change more slowly as well as less precisely…

There is nothing you can do except ensure that your node is online 24/7 … If your node goes offline and a satellite attempts to ping, the lost ping will count against your uptime calculation.

However, low uptime is not currently considered as a disqualifying factor.

If you’ve made copies of the databases for which to use the payment estimator, you can perform the calculation yourself using the following query:

sqlite3 --header --column reputation.db "select hex(satellite_id),(100.00*uptime_success_count/uptime_total_count) as uptime from reputation;"

Here are my percentages:

hex(satellite_id)                                                 uptime          
----------------------------------------------------------------  ----------------
A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000  99.5445561570852
004AE89E970E703DF42BA4AB1416A3B30B7E1D8E14AA0E558F7EE26800000000  99.5972426612966
84A74C2CD43C5BA76535E1F42F5DF7C287ED68D33522782F4AFABFDB40000000  99.5077355836849
AF2C42003EFC826AB4361F73F9D890942146FE0EBE806786F8E7190800000000  99.6753811373254

To display the raw counts:

sqlite3 --header --column reputation.db "select hex(satellite_id),uptime_success_count,uptime_total_count from reputation;"

And here are my raw counts:

hex(satellite_id)                                                 uptime_success_count  uptime_total_count
----------------------------------------------------------------  --------------------  ------------------
A28B4F04E10BAE85D67F4C6CB82BF8D4C0F0F47A8EA72627524DEB6EC0000000  11584                 11637             
004AE89E970E703DF42BA4AB1416A3B30B7E1D8E14AA0E558F7EE26800000000  12859                 12911             
84A74C2CD43C5BA76535E1F42F5DF7C287ED68D33522782F4AFABFDB40000000  9905                  9954              
AF2C42003EFC826AB4361F73F9D890942146FE0EBE806786F8E7190800000000  17195                 17251             

You’ll notice that the total count column varies widely. This is because the satellites don’t attempt to measure uptime at a particular uniform time. The displayed uptime in the Web Dashboard is rounded up. So, on my Dashboard, I have the following four different uptime measurements:

  1. 99.5%
  2. 99.6%
  3. 99.5%
  4. 99.7%

My node has been down for a singular 10 hour stretch in the beginning of September because of a telephone pole fire which melted the fiber optic connections. Other than that my node has been running 100% of the time with zero network failures to my WAN connection.

So, my actual downtime should be listed as 0.368% … but two of the four satellites don’t agree with that actual realtime measurement and one is a little bit more generous than reality.

3 Likes

So i only just got one node working it started at 70% uptime checks and i have another one i finally got running after learning a bit more about port forwarding at 4% uptime checks. Should i just delete identity and get a new one for these nodes. they have only been up a few days.

Your node’s identity and data are tied together. If you delete identity, your data becomes irrelevant and you won’t get your held amount.

Uptime disqualification is currently turned off and only recent uptime measures would be counted anyway. If you’re convinced you can keep these nodes up and running, there is no problem to keep using them. Unless you lost data at some point, your node will recover and run like a charm.

Of course if you only just started them both you can choose to remove the data and start all over just to start with clean stats. But you would have to start all over and I don’t really see a real reason why you should do that.

Ok ill just keep them running and see what happens. i Have a second machine running on a different port and i changed the public address to listen on to 28968. will that still work ?

That will depend on how you have your port forwards set up. If the traffic is incoming on 28968 on the node machine, that will work. This means your routers port forward needs to also forward 28968 from the outside world to 28968 on your node machine. If you could show how you set up the forwards at each step, I can give you a better answer.