Test new downtime tracking system

littleskunk · April 15, 2020, 11:35pm

That is a good one with some downtime:

tracked_at	seconds
01/26/20 07:03	1,284
02/04/20 00:18	1,067

littleskunk · April 15, 2020, 11:37pm

Satellite has not noticed it.

deathlessdd · April 15, 2020, 11:40pm

Did the nodes needed to be updated to a certain version for this tracking to be effective?

littleskunk · April 15, 2020, 11:45pm

Thank you everyone. I think we can close this testround. I have the results I am looking for and will talk with the developer. I will share my observations later.

Th3Van · April 16, 2020, 12:12am

While Storj is testing the downtime tracking system, I’ve made this particular node to be “offline”, because its DNS resolves only to a IPv6 address for two hours (the satellite only uses IPv4). When the two hours has passed, the node should go back online again for one hour, because its DNS now are resolvable to a IPv4 address again. The cycle then repeats.

root@server030:/disk002/logs/storagenode# tac server029-v1.1.1-30029.log |  grep -m1 "Node "  
2020-04-11T03:01:08.052+0200    INFO    Node 1gtSCfPByaf3ijepjJr7fE2iK8CJvKKz5W3dCEThrzqVz71XLn started

root@ns1:# grep "storj" /etc/crontab  
0       2,5,8,11,14,17,20,23  *  *  *  root  /etc/bind/scripts/storj-ipv4.sh
0       0,3,6,9,12,15,18,21   *  *  *  root  /etc/bind/scripts/storj-ipv6.sh

EDIT: ok, I missed the test

littleskunk · April 16, 2020, 12:20am

The satellite will not notice it. At the moment the satellite is busy pinging a lot of nodes that left the network. The idea was to DQ these nodes but that step is not implemented yet. So we keep pinging them all.

The good news is that this also means the downtime tracking system is accurate. In early Februrary we had a bug that is fixed meanwhile. The satellite was counting the time between the last contact success and the first failed ping as downtime. @deathlessdd that happend in your case.
I was wondering what would happen on the other side. Do we add the time of the last successful ping? Doesn’t look like because otherwise we should see at least some results for the 4 hour dowtime nodes. For the moment I have to believe the system is calculating correct and just unable to ping frequently enough. We need at least 2 failed pings in order to count the time between these pings as downtime.

In the next release we already have a few throughput improvements. I also hope we can simply exlude the offline nodes and stop pinging them. At least until DQ is ready to take over.

techcenter · April 16, 2020, 7:47am

i’m sure . sorry for my late reply image|690x172

thepaul · April 16, 2020, 2:15pm

Do you mean there was a contiguous 4 hour block of downtime at some point, or do you just mean 4 hours total? The satellites might reasonably not have noticed the latter case.

littleskunk · April 16, 2020, 2:20pm

4 hours in one block.

I have checked the satellite database. At the moment the satellite is pining 5000 offline storage nodes. Only 500 of them have been seen in the last 7 days and are currently offline. For pinging all 5000 storage nodes the satellite needs 2 days. That means the satellite will give everyone 0 downtime that gets back online before we send the second ping 2 days later.

On the master branch we have some improvements. I expect that we will get down to 4 hours with that. It would help if we could simply exlude any node that has not been seen in the last 7 days. That should bring us down to an 30 minute interval. This fix can be temporary until downtime DQ gets implemented and activated.

BrightSilence · April 16, 2020, 6:02pm

The downtime suspension system would mean it’ll take a while until nodes are disqualified, so you may still have quite a few offline nodes to ping.
I’m just curious, but why does it take 2 full days to ping 5000 nodes?

Pac · April 17, 2020, 7:29pm

I must say I’d like to know too

thepaul · August 26, 2020, 9:18pm

Very late reply, but this was basically because no effort had been expended to make “ping all nodes” fast yet. The easiest thing to do was to try connecting to each node one at a time, establishing a TLS connection and verifying the remote identity, querying the node’s status, updating it in the db, then moving on to the next one. I don’t remember exactly what version of the code was running at the time the earlier replies here took place, but apparently there was a timeout of around 30s that had to elapse before we counted a node as offline and moved on (assuming most of those 5000 nodes were really offline, 2 * 86400 / 5000 is about 30).

Once that approach started causing problems, we changed it.