Uptime check in Dashboard

This is a new feature of the 0.27.1 dashboard

There is no information, anywhere related to HOW it actually works. My node has been running for 6 days and it has not been offline a single second since the first “online” status. Before that it was obviously down as I needed to inspect the inbound traffic before allowing that in the firewall (that is also missing an explanation - who will connect, why etc)

I got an update of 98.1% - of course that’s really low for a 24/7 storage node.

Well, your node may have been unavailable due to your Internet provider having an issue that you were unaware of. There can also be regional outages between one region to another that happen that you wouldn’t notice necessarily. I wouldn’t be concerned about 98.1% as there is always going to be some margin for error there.

I’ve got two ISPs and I’m monitoring my connectivity to be able to change one ISP to be the preferred one, its done in my Juniper SRX firewall using ping. I’m also connected to AWS and others with IP-SEC tunnels and would notice a downtime so no that has not happened.
98.1% is 13h hours downtime / month or 6 days a year so yea. its quite a lot

Well, it’s not so much calculating whether your node is up as it is whether your node is reachable.

I think the root cause is happened when identity was signed (starting from this moment timer is knocking).
Then you start storage node (not immediately after identity was signed)

So, downtime is counting from identity signed, not from first node start.
(time from signing identity till first node start was treated like downtime)
I think it’s wrong. This can be easily reproduced.

4 Likes

I’m thinking that would mean when your node starts, you’d have 100% downtime.

Thanks, that is most likely the right answer as it will obviously take some time after the identity has been created and the node is operational. That means of course that the uptime is incorrect and it will always be incorrect and useless as it’s no being reset ever, lets say a month or even every year. It will just add confusion.

No, I tested it about 10 days ago, I recommended Storj for my colleague and he up storage node, but it started from about 80% uptime, I was surprised. Then I tried to investigate how it possible… and I got new invite and signed my new identity, then I immediately run my new storage node and see that uptime is 100% from start. So, my colleague has a big time-gap between signing identity and first node run (he confirmed it).

I confirm, my storagenode STARTED with 50% uptime,
first launch the day after signing,
now has 99.5%,

2 Likes

It’s interesting. Like the op mentioned, it might be helpful to have some idea how this is calculated overall and what is used to determine up time. Like, does the node determine that the other node is down, or does the satellite? Where is there an admin? Oh wait, I’m one. I’ll ask myself and come up with an answer. Seriously though, I’ll ask one of the tech’s and let you know. Also, I expect the humor in this message to be removed by the global editor in 3, 2, 1…

2 Likes

I second as stated above would be good to get some more insights.
From the satelites I get:
Uptime Check: 99,1% // 98,5% // 98,4% // 97,6%
Audit Check: 96,4% // 99,0% // 90,1% // 97,4%

Couple of questions:

  • First of all the dashboard is monthly, does this also apply to the checks?
  • In my logfiles I never see audit fails, is an audit check something else?
  • So of the above items mentioned are right and it is SINCE START off the node, it could work out (I had some faulty weeks on my raspberry pi3 before I moved to the Synology) but that would count for audit check issues… uptime it was minor, so for me the uptime checks seem ‘low’
  • Which leads me to the question of disqualification, is that then used overall or is there a ‘new check’ starting today and not from the beginning?
1 Like

The only counter used is the satellite to node communication. So, if a satellite goes down, all nodes get docked uptime. This behavior has been confirmed in prior releases.

I guess when an identity is authorized, the node’s identity hash is distributed to the satellites…

Let me draw your attention on Design draft: New way to measure SN uptimes

In short: We are not tracking time. We are only tracking a count. As long as a storage node is online it will checkin once per hour. If a storage node is going offline we might try to reach it 10 times per second. Please keep that in mind when you try to give the uptime percentage any context.

2 Likes

Since a few release the satellites are having 0 downtime. The satellite is not just one process. It is a collection of services. The satellite has at least 2 API endpoints. We take one API endpoint offline, deploy the new version, the satellite is still full operational with the second API endpoint and as soon as the new API endpoint is running it takes over all the requests. We update the second API endpoint and thats it. A deployment with 0 downtime.

However the storage nodes have a downtime. Again we are not tracking time. You will get a random number of failed uptime checks even if you are just restarting your storage node real quick.

No please don’t start guessing. That will only confuse everyone. The code is open source. You can find the insert statement here: storj/satellite/satellitedb/overlaycache.go at 7abad3c6bb56a33e05f849c85048c5830cc48d7b · storj/storj · GitHub

The function name should already explain when it gets called the first time. It will also get called when a storage node is sitting behind a firewall and pings the satellite. The pingback will fail but the insert statement will be executed.

2 Likes
        INSERT INTO nodes
		(
			id, address, last_net, protocol, type,
			email, wallet, free_bandwidth, free_disk,
			uptime_success_count, total_uptime_count, 
            
            ...

			$10::bool::int, 1,
			CASE WHEN $10::bool IS TRUE THEN $24::timestamptz
				ELSE '0001-01-01 00:00:00+00'::timestamptz
			END,
			CASE WHEN $10::bool IS FALSE THEN $24::timestamptz
				ELSE '0001-01-01 00:00:00+00'::timestamptz

If uptime_success_count is set to false upon initial node insert, then that node’s clock is ticking the moment it is inserted. I don’t have time to search through all the source code… but it looks like the Insert statement is counting the first uptime before it is known whether the SN has actually been started by asserting an initial ‘1’ in the total_uptime_count column.

Please correct my understanding if I am not correct.

EDIT and Addendum:

After thinking about this problem for a few minutes, it seems fairly clear that a really good idea would be to put in a software switch for an SN to utilize to signal that a node is now ready to be utilized… as in…

OK, now I’ve opened the ports in my firewall, and I see “Test Audits” showing… and now I can hit the “Production Ready” button or commandline option

Once a SNO sends the OK to GO signal, then the satellite starts the uptime clock.

1 Like

Please read the design document I have linked. There is no clock ticking!

1 Like

Uptime Check SN Selection

This design doc doesn’t change the SN selection process for uptime checks, therefore, the current implemented mechanism with some minor modifications to classify those nodes which fail the uptime check in order of being recheck by the second part of the algorithm.

Uptime Recheck Loop

This process should run independently from any current satellite process (i.e. being a service) and it should run in a configurable time interval.

The algorithm for each time interval iteration is the following:

  1. Select the first row from failed_uptime_checks table which has back_online and disqualified columns set to NULL sorted by last_check in ascending order. If there is no row, ends (the process will be executed in the next interval).

It seems to me that by asserting 1 in the initial INSERT statement for an SN, that a node would begin with at least one failed_uptime_check if that node was inserted before it was actually started by the SNO.

You are quoting a design document that is not implemented yet. There are some informations in that document about the current situation.

The problem with the current uptime SN reputation score calculation is that it isn’t based on the actual time that a node is being offline

That is not possible. To trigger that insert statement the storage node operator has to start the node. There is no connection between the certificate authority server and all satellites. Only the storage node can trigger the insert statement.

1 Like

OK…

As far as a clock ticking goes, the unimplemented design doc clearly shows that time is measured between increments of the uptime_count column per node row in the satellite database. I would define this behavior as a ticking clock

As far as the unimplemented design doc goes, the assertion of an initial 1 in the uptime_count column is most unfortunate in the implemented released source code.

A clock has a fixed interval. The counter has no interval at all. It can be one success per hour vs 100 failures per minute. There is no clock. It is just a counter.

Yes that is correct and that is how it should be. Beside the fact that the counter is useless ofc. The idea here is that the first checkin counts as uptime check.