macOS system clock is out of sync: system clock is out of sync with all trusted satellites

Bob · July 22, 2020, 4:01pm

Had a very bad day after a routine maintenance none of my 2 nodes running for months passed the preflight check, I did not find the solution however I wanted to share my investigations. First, I had a look to all previous post to get an answer, and none was adapted to my context.
Finally fixed the problems moving my nodes to another Mac.
I was afraid of loosing my oldest nodes and to be DQ, after a few hours turning around suspecting the storage filesystem/db to corrupted. Still unexplained, but might interested people in the same config.

Pattern
infinite looping on fatal error at startup, node stays in status offline
Did a few attempts to troubleshoot this issues

restarted docker ; rebooted {no effect}
change ntp server in date & time preferences to time.google.com ; time.apple.com ; pool.ntp.org {no effect}
added a few lines in config.yaml : uncommenting timeout and preflight check variables {no effect}
finally moved the disks to the backup Mac mini and carefully cleaned the containers to recreate them on the backup machine {success}

Question : Is there a way to check the machine without taking the risk of getting DQ by running storage node container ?

Configuration details
Node version 1.6.4
macos10 Catalina
2 full 1.9TB nodes running on a single machine
logs are available on demand in INFO mode

Logs extract
2020-07-22T08:46:38.627Z INFO Telemetry enabled
2020-07-22T08:46:38.690Z INFO db.migration Database Version {“version”: 42}
2020-07-22T08:46:48.918Z WARN trust Failed to fetch URLs from source; used cache {“source”: “https://tardigrade.io/trusted-satellites”, “error”: “HTTP source: Get https://tardigrade.io/trusted-satellites: dial tcp: lookup tardigrade.io on 192.168.65.1:53: read udp 172.17.0.2:57347->192.168.65.1:53: i/o timeout”, “errorVerbose”: “HTTP source: Get https://tardigrade.io/trusted-satellites: dial tcp: lookup tardigrade.io on 192.168.65.1:53: read udp 172.17.0.2:57347->192.168.65.1:53: i/o timeout\n\tstorj.io/storj/storagenode/trust.(*HTTPSource).FetchEntries:63\n\tstorj.io/storj/storagenode/trust.(*List).fetchEntries:90\n\tstorj.io/storj/storagenode/trust.(*List).FetchURLs:49\n\tstorj.io/storj/storagenode/trust.(*Pool).fetchURLs:240\n\tstorj.io/storj/storagenode/trust.(*Pool).Refresh:177\n\tstorj.io/storj/storagenode.(*Peer).Run:688\n\tmain.cmdRun:200\n\tstorj.io/private/process.cleanup.func1.4:359\n\tstorj.io/private/process.cleanup.func1:377\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:320\n\truntime.main:203”}
2020-07-22T08:46:48.924Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
2020-07-22T08:46:58.931Z ERROR preflight:localtime unable to get satellite system time {“Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “error”: “rpccompat: dial tcp: lookup europe-north-1.tardigrade.io on 192.168.65.1:53: read udp 172.17.0.2:44234->192.168.65.1:53: i/o timeout”, “errorVerbose”: “rpccompat: dial tcp: lookup europe-north-1.tardigrade.io on 192.168.65.1:53: read udp 172.17.0.2:44234->192.168.65.1:53: i/o timeout\n\tstorj.io/common/rpc.Dialer.dialTransport:290\n\tstorj.io/common/rpc.Dialer.dial:267\n\tstorj.io/common/rpc.Dialer.DialNodeURL:177\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2020-07-22T08:47:33.033Z FATAL Failed preflight check. {“error”: “system clock is out of sync: system clock is out of sync with all trusted satellites”, “errorVerbose”: “system clock is out of sync: system clock is out of sync with all trusted satellites\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check:96\n\tstorj.io/storj/storagenode.(*Peer).Run:692\n\tmain.cmdRun:200\n\tstorj.io/private/process.cleanup.func1.4:359\n\tstorj.io/private/process.cleanup.func1:377\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:320\n\truntime.main:203”}

littleskunk · July 22, 2020, 4:13pm

That looks more like a problem with your connection to the satellite. Otherwise the message would print out the time offset but instead it prints out that it couln’t reach the satellite.

mike · July 22, 2020, 5:46pm

DNS to 192.168.65.1 seems to be non functional?

Bob · July 24, 2020, 9:57pm

Thx for your post @mike, there seems to be a local DNS on the logs indeed. I checked my network preferences and there is no reference to DNS 192.168.65.1. My gateway is my router 192.168.1.1 and I have no local configuration. No idea where it can come from. docker configuration ?

Alexey · July 25, 2020, 6:29am

Looks like. Try to restart the docker from the docker desktop application.

Bob · July 26, 2020, 7:19am

Already tried that two, restarting docker desktop and/or rebooting has no effect. I had 6 hours outage trying to fix this. Moving to another macmin saved the day, however this could happen again.
Thanks @Alexey for your proposal.