New storage node downtime tracking feature

Odmin · January 17, 2020, 8:43pm

Can I ask you about where I can find information about the storage node downtime tracking feature?

I see that is already implemented on 118 satellite:

And second question, how can SNO monitor uptime status via API at this moment?

BrightSilence · January 17, 2020, 11:30pm

I predict some earnings calculator breakage in my future

Would also like to know a little more. If possible specifically about changes in the node db’s and API’s.

cdhowie · January 18, 2020, 12:13am

Ditto. It would be great if these changes could be communicated clearly to SNOs in advance. We use these metrics for tracking our nodes and it’s frustrating to wake up to this from my monitoring server when nothing is actually wrong.

littleskunk · January 18, 2020, 11:52am

I am sorry but I am unable to tell you more. The new uptime tracking is not working. I was told that I should not waste any time on it. As long as we don’t disqualify storage nodes I can follow this advice and ignore the topic.

My storj-sim instance should have 0 downtime for all storage nodes. I can see that this is not the case and I have an idea why that is happending. Someone wants to volunteer?

Odmin · January 18, 2020, 3:13pm

Of course, I would like to help!

anon27637763 · January 18, 2020, 3:42pm

Proof of Authority Blockchain clock running independently on each Satellite… each SN checks in every 10 minutes. Bandwidth requirement is tiny, since no data needs to be exchanged. There’s no way to cheat… Does not require SQL query… Satellite can query local blockchain every hour to update SQL DB.

Alexey · January 18, 2020, 3:51pm

The blockchain is not needed for that. Just excess work and coordination. Each satellite measure uptime independently. And the probability to change that is very low at the moment

anon27637763 · January 18, 2020, 3:54pm

Blockchain is necessary for assurance reasons. Both sides of the communication channel need to be able to say “I believe the result” … blockchain does that. The advantage of a blockchain ledger system is that the code can be a very simple smart contract and run independently from the rest of the Satellite software.

Alexey · January 18, 2020, 4:59pm

I know how the blockchain work. Who will pay for the smart contract to be executed?
Developing an own blockchain is a waste of resources. There are plenty of them. The blockchain is not needed in the Storj network at all. It’s slow, it should be coordinated, it’s expensive (resources or/and money).
Especially when the storagenode must trust a satellite, otherwise it will not join the network.

Vadim · January 18, 2020, 5:08pm

Agree with Alexey. Monitoring soft that ping outside will solve problem. Also dashboard give uptime info. If dashboard would give online status will be perfect.

anon27637763 · January 18, 2020, 5:28pm

PoA is neither slow nor expensive…

PoA Description on Changelly

PoA is centralized, but that’s fine since Storj reputation is centralized. Implementing a PoA blockchain for tracking uptime reputation would work quite well. It doesn’t need to cost anything except development time for a private blockchain. The node’s data reputation could also be added to any PoA blockchain solution.

If such blockchain were implemented, the entire discussion of missing downtime counts is OBE. The reputation equations are already written, and implementing those equations in a smart contract is trivial.

As you point out, SNs must already trust the Satellites… so why not let the blockchain do the hard work for practically nothing.

Odmin · January 18, 2020, 5:29pm

Colleagues, let’s keep thread clean.
Topic is: “New storage node downtime tracking feature”

littleskunk · January 18, 2020, 5:37pm

You need a storj-sim instance for this test. When you connect to the postgres database you will find a new table called nodes_offline_times.

I would recommend the following satellite settings:
audit.queue-interval: 24h0m0s to make sure the satellite doesn’t contact the storage nodes with any other service

I also recommend the following storage node settings:
contact.interval: 24h0m0s to make sure the storage node is not pinging the satellite.

Teststeps:

Run storj-sim, make sure nodes_offline_times is empty, delete any data that might be in there.
Stop one storage node, every 30 seconds the satellite should ping all storage nodes, one of them will fail to respond.
What is getting inserted into the table?
After 1 failed ping run the storage node as a stand alone process storagenode run --config-dir .local/share/storj/local-network/storagenode/0

My expectation:
Lets say the storage node is failing 1 ping and the next one is successful. That will look like this:

Successful ping at 0:00:00
Failed ping at 0:00:30
Storage node startup CheckIn ping at 0:00:50

I bet the current implementation takes the full time between both successful pings. That wouldn’ be correct. In production that would mean the satellite will apply a full 1h downtime even if the storage node was offline for only 5 minutes. Correct would be to exclude the time from the first successful ping. In this example I would expect 20 seconds but the current implementation might return 50 seconds.

Let me know if you have any problems or need additional informations to verify this theory. If you have problems with the timing you can also increase the interval on the satellite side. Run the same test with a 5 minute interval and it should get a bit more obvious what is going on.

Odmin · January 18, 2020, 7:37pm

I understood the problem, will hunting for it, preparing test environment now

Odmin · January 19, 2020, 9:15am

The test environment is ready, I will follow the reproduction steps and get back to you with the results.

Odmin · January 19, 2020, 9:52am

@littleskunk I have some problem with starting storagenodes on storj-sim:

2020-01-19T11:42:43.883+0200 FATAL storagenode/peer.go:455 failed preflight check {"error": "system clock is out of sync: system clock is out of sync with all trusted satellites", "errorVerbose": "system clock is out of sync: system clock is out of sync with all trusted satellites\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check:96\n\tstorj.io/storj/storagenode.(*Peer).Run:454\n\tmain.cmdRun:206\n\tstorj.io/storj/pkg/process.cleanup.func1.2:299\n\tstorj.io/storj/pkg/process.cleanup.func1:317\n\tgithub.com/spf13/cobra.(*Command).execute:826\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:914\n\tgithub.com/spf13/cobra.(*Command).Execute:864\n\tstorj.io/storj/pkg/process.ExecWithCustomConfig:79\n\tstorj.io/storj/pkg/process.Exec:61\n\tmain.main:326\n\truntime.main:203"}

tried to enable and extend:
# allows for small differences in the satellite and storagenode clocks
retain.max-time-skew: 256h0m0s

but no result, error is the same.
I prepared and test storj-sim environment yesterday, then go to sleep, today I wake up and have this issue.

littleskunk · January 19, 2020, 10:35am

You can disabled that with preflight.local-time-check: false on the storage node side.

Odmin · January 19, 2020, 10:42am

Thanks! already solved by reverting the test environment for yesterday’s step. Now collecting information…

Odmin · January 19, 2020, 11:19am

nodes_offline_times is empty:

Summary

Stopped and waiting for ping fail:
pkill -f ‘storagenode --metrics.app-suffix sim --log.level debug --config-dir /root/.local/share/storj/local-network/storagenode/0’

storagenode/0 12Evc4gbFx 13:07:15.468 | INFO process/exec_conf.go:106 Got a signal from the OS: "terminated"

storagenode/9         12paQUpC4j 13:07:36.431 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:40882"}
storagenode/6         12JwFLkQCX 13:07:36.446 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:33558"}
storagenode/8         1McZDMowvh 13:07:36.459 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:47832"}
storagenode/4         1jBKU9byaE 13:07:36.473 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:46000"}
storagenode/2         1CykvQkEg1 13:07:36.489 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:38178"}
storagenode/3         1379dcd1tH 13:07:36.504 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:47586"}
storagenode/5         12WgANaHTM 13:07:36.518 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:59336"}
satellite-core/0                 13:07:36.524 | INFO    contact:service contact/service.go:78   pingBack failed to dial storage node    {"Node ID": "12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy", "node address": "127.0.0.1:13000", "pingErrorMessage": "failed to dial storage node (ID: 12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy) at address 127.0.0.1:13000: rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused", "error": "rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused", "errorVerbose": "rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused\n\tstorj.io/common/rpc.Dialer.dialTransport:242\n\tstorj.io/common/rpc.Dialer.dial:219\n\tstorj.io/common/rpc.Dialer.DialAddressID:138\n\tstorj.io/storj/satellite/contact.dialNode:21\n\tstorj.io/storj/satellite/contact.(*Service).PingBack:70\n\tstorj.io/storj/satellite/downtime.(*Service).CheckAndUpdateNodeAvailability:39\n\tstorj.io/storj/satellite/downtime.(*DetectionChore).Run.func1:59\n\tstorj.io/common/sync2.(*Cycle).Run:147\n\tstorj.io/storj/satellite/downtime.(*DetectionChore).Run:43\n\tstorj.io/storj/satellite.(*Core).Run.func15:437\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storagenode/7         124wGtyPhS 13:07:36.538 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:55954"}
storagenode/1         1sXFDyUX4P 13:07:36.552 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:36356"}

Tabe nodes_offline_times have it:

Summary

storagenode run --config-dir .local/share/storj/local-network/storagenode/0

satellite/0 129TWQz16B 13:07:54.377 | DEBUG contact:endpoint contact/endpoint.go:112 get system current time {“timestamp”: “2020-01-19 11:07:54.376993364 +0000 UTC”, “node id”: “12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy”}
satellite/0 129TWQz16B 13:07:54.410 | DEBUG contact:endpoint contact/endpoint.go:94 checking in {“node addr”: “127.0.0.1:13000”, “ping node success”: true, “ping node err msg”: “”}

satellite-core/0 13:08:06.420 | DEBUG downtime:detection downtime/detection_chore.go:46 checking for nodes that have not had a successful check-in within the interval. {“interval”: “30s”}
satellite-core/0 13:08:06.421 | DEBUG downtime:estimation downtime/estimation_chore.go:47 checking uptime of failed nodes {“interval”: “30s”}
satellite-core/0 13:08:06.422 | DEBUG downtime:detection downtime/detection_chore.go:54 nodes that have had not had a successful check-in with the interval. {“interval”: “30s”, “count”: 0}

Tabe nodes_offline_times have it:

Summary

I posted the full log here.

Summary:
13:07:15 - signal for termination
13:07:36 - ping fail
13:07:54 - checkin ping

Total node downtime - 39s.
Table nodes_offline_times - 29s.

I did another tests with the same reproduction steps (but storage node termination time shift is another):
16:18:16 - signal for termination
16:18:54 - ping fail
16:18:57 - checkin ping

Total node downtime - 41s.
Table nodes_offline_times - 29s.

Summary

Full log is here

And another final test:
16:34:46 - signal for termination
16:35:18 - ping fail
16:35:19 - checkin ping

Total node downtime - 33s.
Table nodes_offline_times - 29s.

Summary

Full log is here

@littleskunk I can confirm your theory! Calculation is wrong, if termination of storage node time id closer to ping check and storage node checkin time also is closer to ping check we will have a big problem in production.

I did add more interesting test that more closer to check ping status on satellite:

09:03:18 - signal for termination
09:03:22 - ping fail
09:03:24 - checkin ping

Total node downtime - 6s.
Table nodes_offline_times - 29s.

Summary

Full log is here

Pavmer · January 19, 2020, 1:35pm

I’m not sure if you will need it, but I use the following free and easy tool to synchronize system time - http://www.timesynctool.com/
Work very well…
I hope it will be useful to someone.