New storage node downtime tracking feature

Odmin · January 19, 2020, 1:57pm

Thanks for the proposition, but I already have NTP sync. on all servers.

Odmin · January 20, 2020, 2:00pm

@littleskunk Please let me know if you need any additional tests, I will glad to help.

moby · February 5, 2020, 8:49pm

Hey Odmin! I am one of the developers who worked on the new downtime tracking feature and I am curious to know more about your tests. If our downtime tracking underestimates downtime slightly (which it will), that is okay, but in a situation where it overestimates, that is what I am worried about. I am referring to this test you ran:

I want to describe what I think is happening, then maybe you can verify whether or not you agree, and whether we can say this is an issue in production (@littleskunk please let me know your opinion as well):

When you start a new storj-sim instance, the storagenode contact chore (configured at a 30s interval) is started up at almost the same time as the satellite downtime detection chore (also configured at a 30s interval). This is important to note because it explains why you always get 29s as your downtime regardless of when you take your node down - all 10 nodes ping the satellite to check in, and almost at the same time, the detection chore runs. When you take a node offline, the detection chore won’t notice until it has been offline for at least the duration of the detection chore (storj/satellite/downtime/detection_chore.go at efa0f6d443b5fcb2f950a933262e6f70438d59af · storj/storj · GitHub). Then, if the satellite pings a node that has been offline (for more than 30s in this case), it calculates offline time with now - lastSeenTime - 30s (storj/satellite/downtime/detection_chore.go at efa0f6d443b5fcb2f950a933262e6f70438d59af · storj/storj · GitHub).

The main point I am trying to make is that the satellite doesn’t know about or pay attention to the time of your signal for termination. It only cares about the last time you checked in or were pinged by the satellite. So if at t=0, you check in, then at t=44 you take your node offline, then at t=45 the offline detection chore runs (and is on a 30s interval), your node will be marked as having 45 - 0 - 30 = 15 seconds of offline time, even though the node was actually only taken down for 1s.

EDIT: the above example is bad, because the node would also check in at t=30, meaning the 15s offline time would be accurate. And so my original suggested change is irrelevant and I removed it.

I would still like to understand more about the conditions that resulted in @Odmin’s test, which I quoted above. If possible, it would be nice to add some logging while the test is run.

I am happy to clarify anything, and am looking forward to more discussion on this.

moby · February 5, 2020, 9:58pm

@Odmin, in the test I quoted above, do you know at what timestamp the previous checkin ping (before signal for termination) was? The node should only be pinged by the detection chore if it was already longer than 30s since it checked in - so 4s of downtime should never in any circumstances result in the detection chore pinging.

littleskunk · February 5, 2020, 10:40pm

@moby this was the test setup.

moby · February 5, 2020, 11:40pm

@littleskunk When I follow those steps, I do not get any entries inside the nodes_offline_times table. I only see an entry when I leave the node offline until after 0:01:00 (enough time for two detection chore intervals). The first detection chore doesn’t pick up the offline node since more than 30s has not passed since the last successful ping. If the node checks in again at 0:00:50, the second detection chore doesn’t pick it up either.

So bringing the node back online at 0:00:50 -> 0 seconds downtime, but bringing the node back online at 0:01:20 -> 29 seconds downtime.

Odmin · February 6, 2020, 11:23am

Hi @moby !
Glad to hear you!

All my reproduction steps are described here I follow all recommendations that @littleskunk is provided for me.
I repeat it for every portion of tests and for this too:

You can look into “Summary”, for each portion of tests I attaching full log with timestamps starting from “signal for termination”.

The last test was hard to reproduce, what I did for it:

Run storj-sim, make sure nodes_offline_times is empty, delete any data that might be in there.
Waiting for the next check from the satellite side, start stopwatch timer on my phone.
Waiting while the time on the stopwatch remains a few seconds before the finish and terminate first storagenode.
Waiting until “pingErrorMessage” for those killed node and start it immediately after this message.
Waiting for “checking in” for this node on log.
Waiting for next satellite check for nodes
Finally, look into nodes_offline_times table and calculate real storagenode downtime between “kill” and “checking in” time.

Here is a log of my last test, you can check it.

Also, you should be really fast shooter for reproduce it

Summary

I did it only after 15-20 min. practice with “kill” and “up”, but it reproducible.

moby · February 10, 2020, 10:54pm

The following commit modifies the downtime tracking feature so we do not ever overestimate storagenode downtime: https://github.com/storj/storj/commit/c4a9a5d48b9ab7f3f4362c57eb3cf979087fe933
Commit message:

satellite/downtime: update detection and estimation downtime chores for more trustworthy downtime tracking

Detection chore: Do not update downtime at all from the detection chore.
We only want to include downtime between two explicitly failed ping attempts
(the duration between last contact success and the first failed ping is no longer
included in downtime calculation)

Estimation chore: If the satellite started after the last failed ping for a node,
do not include offline time since the last failed ping time - only
estimate based on two failed pings with no satellite downtime in
between.
This protects us from including satellite downtime in our storagenode downtime calculations.

Odmin · February 10, 2020, 11:38pm

Thanks a lot @moby!
It is really important for the production launch.