New storage node downtime tracking feature

littleskunk · January 18, 2020, 5:37pm

You need a storj-sim instance for this test. When you connect to the postgres database you will find a new table called nodes_offline_times.

I would recommend the following satellite settings:
audit.queue-interval: 24h0m0s to make sure the satellite doesn’t contact the storage nodes with any other service

I also recommend the following storage node settings:
contact.interval: 24h0m0s to make sure the storage node is not pinging the satellite.

Teststeps:

Run storj-sim, make sure nodes_offline_times is empty, delete any data that might be in there.
Stop one storage node, every 30 seconds the satellite should ping all storage nodes, one of them will fail to respond.
What is getting inserted into the table?
After 1 failed ping run the storage node as a stand alone process storagenode run --config-dir .local/share/storj/local-network/storagenode/0

My expectation:
Lets say the storage node is failing 1 ping and the next one is successful. That will look like this:

Successful ping at 0:00:00
Failed ping at 0:00:30
Storage node startup CheckIn ping at 0:00:50

I bet the current implementation takes the full time between both successful pings. That wouldn’ be correct. In production that would mean the satellite will apply a full 1h downtime even if the storage node was offline for only 5 minutes. Correct would be to exclude the time from the first successful ping. In this example I would expect 20 seconds but the current implementation might return 50 seconds.

Let me know if you have any problems or need additional informations to verify this theory. If you have problems with the timing you can also increase the interval on the satellite side. Run the same test with a 5 minute interval and it should get a bit more obvious what is going on.