Sattelite downtime problem

Odmin · October 30, 2019, 8:31am

Today I see that one (118) of satellites was down (I think upgraded):

Downtime start: 2019-10-30 09:42:53 UTC+2
Downtime end: 2019-10-30 10:17:57 UTC+2
Duration: 35 minutes.

I make a screen of my nodes during downtime:

And make it after satellite is up:

As you can see, satellite downtime affect storagenodes uptime!
I think it an issue.

kevink · October 30, 2019, 10:35am

I can confirm that.
Let’s hope it’ll be fixed when the new system for uptime checks is implemented.
However these are still testing satellites and in production a satellite is not allowed to have any downtime.

Odmin · October 30, 2019, 1:44pm

Thanks for confirmation @kevink!
I just take care of this issue not go live to production… so, I reported it on current time and also hope that it will be fixed on new uptime check system.

anon27637763 · October 30, 2019, 2:27pm

There should be a third party entity validating the uptime of a satellite node independent of a storage node. The current method of counting time between pings is never going to give a correct value.

Perhaps a solution to the conundrum is to use some sort of DNS validator … or something like a STUN server which would be able to validate that a given host is up and the appropriate port is open to the Internet.

kevink · October 30, 2019, 2:28pm

They are already working on a new system and have a post somewhere here (please search for yourself). So we just need to wait for that new system and we all know the current system has a lot of flaws.

anon27637763 · October 30, 2019, 2:52pm

There’s actually no need for an SN uptime metric. The network has already has a timed factor based on piece store expiration. Measuring SN uptime is redundant, wastes scarce bandwidth, and is error prone.

If a given node drops off the network for a long time, the data stored on the node will be old and therefore the node will be less likely to have data relevant to the requests. If a node that dropped off the network suddenly springs back to life, then there’s a vetting process in place to slowly bring that node back on…

So, there’s really no need to count downtime as a node metric… it’s already builtin to the network. And there’s no need to disqualify nodes that pop in and out of the network… that’s what quarantine does.

kevink · October 30, 2019, 2:54pm

I’m not sure about that but we’ll see what they’ll come up with.

BrightSilence · October 31, 2019, 8:32am

Pieces don’t have an expiration date by default on v3, so your argument doesn’t stand. Nodes that are frequently offline will also cause lots of repair to be triggered and eventually be a big cost to the network as all that repair traffic needs to be paid for. Uptime is very important. I would argue for loosening the requirements a bit, but never for removing them altogether, that would be very bad for the network.

anon27637763 · October 31, 2019, 12:03pm

There is still no method that will correctly measure uptime of an SN via in-channel communications. It just is not possible to do. If both sides of the communication channel can be down for an unknown length of time, then there is no method which will be able to correctly measure the downtime of either side.

A third entity is required in order to measure the relative uptime of an SN vs. a satellite. The third entity could be running on a satellite node’s infrastructure… but it can’t be measuring SN downtime via in-band signaling while also dropping out of the network at random moments for unknown lengths of time. Such policy will never result in a correct measure.

One could do some creative process using cryptographically tied clocks running on both SN and satellites. However, such process will only show the view from inside the host… the network connection to the host is not possible to measure without a third entity in the mix.

I don’t know what the numbers look like for bandwidth usage of repair traffic vs. ping traffic, but it’s possible that removing the ping traffic for measuring uptime would create enough room for the relatively small amount of repair traffic. Any way you look at the situation, if SNs are going to be disqualified for some performance metric, eventually the number of SNs will shrink and those SNs will become defacto centralized nodes in the network.

kevink · October 31, 2019, 12:23pm

Your assumption is wrong. The satellite can NEVER be down in production!

If it is, you have other problems than SNO uptime falling a bit. Like angry customers, inaccessible data, etc. Would be like AWS being down.

anon27637763 · October 31, 2019, 1:04pm

I’m simply stating the reality on the ground at the moment. If at some future time all satellites have guaranteed 100% uptime, then -of course- pinging SNs is fine. But then it must be remembered that the measured uptime value of an SN can never be finer than twice the ping time. So, if the goal is to disqualify SNs with greater than 5 hours of downtime in a given month… then the ping time needs to be frequent enough to ensure that an SN can be down for close to that actual value.

Measuring every 15 minutes yields a 30 minute resolution. So, an SN might only be actually unavailable for a small amount time while the network will indicate a much larger time. There’s no way to fix this problem since it is basic sampling theory applicable to all areas of signal measurement and is known as the Nyquist Frequency.

If, however, a given satellite node goes down --for whatever reason-- then all SNs that were active and available when the satellite goes offline should remain as if those SNs were online the entire time the satellite was down. If a satellite node is guaranteed to be available 100%, then it is inconsistent to account for satellite downtime in the calculation of SN uptime.

kevink · October 31, 2019, 1:08pm

With TCP keepalives you could just set it to 5 minutes and watch the socket. But that is just one way (and maybe not the safest?) However, we are not getting paid to find the best way for keepalives Those that do will come up with a good way.

anon27637763 · October 31, 2019, 1:39pm

This is out-of-band signalling. However, it will only indicate that a host is reachable… it won’t indicate that the host is available for data within the network.

BrightSilence · October 31, 2019, 1:40pm

Let me just short this discussion by posting the link to the new design doc.

There’s no need to speculate when we know there is more information available already.

There is also no need to complicate things. You don’t need the exact down time, just the down time you’re certain about. As soon as an occasional uptime check detects the node is down, it will simply assume that that just happened that instance. The increased frequency checking will start until the node is back up or disqualified. This should work just fine to get a good estimation that is in favor of the storagenode if they were offline for a while already when the first uptime check hit it.

When a satellite is down (or disconnected from the network) uptime checks should not take place. This seems fairly simple to implement, but as mentioned likely no longer needed in a production implementation.

It seems to me that this is a case where good enough is just fine as long as you air on the side of giving the storagenode the benefit of the doubt.

PS. I’m pretty sure the Nyquist frequency only applies to sinusoidal functions. Like sound waves. It doesn’t really apply here.

anon27637763 · October 31, 2019, 1:50pm

I’m not so sure that any calculation based on node uptime using in-band pings is going to work well for measuring a relatively small 5 hour per month window. Any reasonably precise measurement isn’t going to scale well across thousands of SNs.

But… it’s not my network… I just run an SN until I don’t.

Applies to any and all temporal measurements.

There’s even a decent implementation in the old movie “Dressed to Kill” … taking pictures of persons walking in and out of the doctor’s office… The kid measures the time to walk to the door and sets his camera timer to take pictures at half that rate… Nyquist in the movies.

Alexey · November 1, 2019, 8:06am

Please, read the document
This is not inband ping, it’s a request to the storagenode API.

kevink · November 1, 2019, 8:10am

Nobody claimed it would be
Anyways a lot of unimportant messages in this thread.

anon27637763 · November 1, 2019, 11:39am

The storage node communicates with the satellite via the singular open port. When 118 went down in the beginning of Oct… other ports were open on the satellite IP address, but my storage node’s uptime clock still ticked downwards.

This is an unacceptable condition if such clock ticking downwards is a determining factor in storage node disqualification.

I’ve already pointed out in a prior post that the satellite could run a third entity service which connects to a storage node via a guaranteed channel… i.e. When a satellite node goes offline for maintenance, there’s still a ping port open.

kevink · November 1, 2019, 12:11pm

Stop whining about something that is caused by a soon replaced system, which also won’t happen in a production environment unlike the current beta.
Your post is irrelevant.

anon27637763 · November 1, 2019, 1:07pm

Better to voice an opposing opinion before the cement dries.

It’s possible … once I thought I was wrong, but I was mistaken