New storage node downtime tracking feature

Odmin · January 19, 2020, 11:19am

nodes_offline_times is empty:

Summary

Stopped and waiting for ping fail:
pkill -f ‘storagenode --metrics.app-suffix sim --log.level debug --config-dir /root/.local/share/storj/local-network/storagenode/0’

storagenode/0 12Evc4gbFx 13:07:15.468 | INFO process/exec_conf.go:106 Got a signal from the OS: "terminated"

storagenode/9         12paQUpC4j 13:07:36.431 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:40882"}
storagenode/6         12JwFLkQCX 13:07:36.446 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:33558"}
storagenode/8         1McZDMowvh 13:07:36.459 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:47832"}
storagenode/4         1jBKU9byaE 13:07:36.473 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:46000"}
storagenode/2         1CykvQkEg1 13:07:36.489 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:38178"}
storagenode/3         1379dcd1tH 13:07:36.504 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:47586"}
storagenode/5         12WgANaHTM 13:07:36.518 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:59336"}
satellite-core/0                 13:07:36.524 | INFO    contact:service contact/service.go:78   pingBack failed to dial storage node    {"Node ID": "12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy", "node address": "127.0.0.1:13000", "pingErrorMessage": "failed to dial storage node (ID: 12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy) at address 127.0.0.1:13000: rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused", "error": "rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused", "errorVerbose": "rpccompat: dial tcp 127.0.0.1:13000: connect: connection refused\n\tstorj.io/common/rpc.Dialer.dialTransport:242\n\tstorj.io/common/rpc.Dialer.dial:219\n\tstorj.io/common/rpc.Dialer.DialAddressID:138\n\tstorj.io/storj/satellite/contact.dialNode:21\n\tstorj.io/storj/satellite/contact.(*Service).PingBack:70\n\tstorj.io/storj/satellite/downtime.(*Service).CheckAndUpdateNodeAvailability:39\n\tstorj.io/storj/satellite/downtime.(*DetectionChore).Run.func1:59\n\tstorj.io/common/sync2.(*Cycle).Run:147\n\tstorj.io/storj/satellite/downtime.(*DetectionChore).Run:43\n\tstorj.io/storj/satellite.(*Core).Run.func15:437\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
storagenode/7         124wGtyPhS 13:07:36.538 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:55954"}
storagenode/1         1sXFDyUX4P 13:07:36.552 | DEBUG   contact:endpoint        contact/endpoint.go:52  pinged  {"by": "129TWQz16BbmvR5Wra6kMjAZBX5z8qUVTqBctCUexgYCYzVtku5", "srcAddr": "127.0.0.1:36356"}

Tabe nodes_offline_times have it:

Summary

storagenode run --config-dir .local/share/storj/local-network/storagenode/0

satellite/0 129TWQz16B 13:07:54.377 | DEBUG contact:endpoint contact/endpoint.go:112 get system current time {“timestamp”: “2020-01-19 11:07:54.376993364 +0000 UTC”, “node id”: “12Evc4gbFxjGPYtEeZgWP8sW3yHDq41Wrb5TsngrYxyxS93gchy”}
satellite/0 129TWQz16B 13:07:54.410 | DEBUG contact:endpoint contact/endpoint.go:94 checking in {“node addr”: “127.0.0.1:13000”, “ping node success”: true, “ping node err msg”: “”}

satellite-core/0 13:08:06.420 | DEBUG downtime:detection downtime/detection_chore.go:46 checking for nodes that have not had a successful check-in within the interval. {“interval”: “30s”}
satellite-core/0 13:08:06.421 | DEBUG downtime:estimation downtime/estimation_chore.go:47 checking uptime of failed nodes {“interval”: “30s”}
satellite-core/0 13:08:06.422 | DEBUG downtime:detection downtime/detection_chore.go:54 nodes that have had not had a successful check-in with the interval. {“interval”: “30s”, “count”: 0}

Tabe nodes_offline_times have it:

Summary

I posted the full log here.

Summary:
13:07:15 - signal for termination
13:07:36 - ping fail
13:07:54 - checkin ping

Total node downtime - 39s.
Table nodes_offline_times - 29s.

I did another tests with the same reproduction steps (but storage node termination time shift is another):
16:18:16 - signal for termination
16:18:54 - ping fail
16:18:57 - checkin ping

Total node downtime - 41s.
Table nodes_offline_times - 29s.

Summary

Full log is here

And another final test:
16:34:46 - signal for termination
16:35:18 - ping fail
16:35:19 - checkin ping

Total node downtime - 33s.
Table nodes_offline_times - 29s.

Summary

Full log is here

@littleskunk I can confirm your theory! Calculation is wrong, if termination of storage node time id closer to ping check and storage node checkin time also is closer to ping check we will have a big problem in production.

I did add more interesting test that more closer to check ping status on satellite:

09:03:18 - signal for termination
09:03:22 - ping fail
09:03:24 - checkin ping

Total node downtime - 6s.
Table nodes_offline_times - 29s.

Summary

Full log is here