Node fails to start on Linux

Pac · January 15, 2021, 3:42pm

My nodes just faced a similar issue, but on Linux.

My ISP Box lost Internet connectivity for about 20 minutes and restarted for unknown reasons, leaving my nodes without Internet.

I thought they would reconnect automatically, but something abnormal happened apparently as most of my nodes shut down instead of resuming operation (I noticed this but couldn’t restart them before some hours as I was on the road).

Here is an excerpt of one of my nodes’ logs at the moment when it stopped (@ 2021-01-15T11:00:43Z), and when I restarted it:

2021-01-15T11:08:54.769Z	INFO	piecestore	download started	{"Piece ID": "6RSLYXDV4BOAQB2AG6JIMLJKGFEOC35MJESWI4F3FJQU6ODLISMQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-01-15T11:08:55.715Z	INFO	piecestore	downloaded	{"Piece ID": "ZVU2TPYPZF6323HQIP7S2THOLQ5WRKCNMF3TZLOTBUWTUJEJUW7A", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2021-01-15T11:08:56.664Z	INFO	piecestore	download started	{"Piece ID": "2LTR33IJ5C7J5EOU5UJIYOJIYOII7MKCRMAU6TB2Y3CY3PYPJZDA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-01-15T11:09:03.324Z	INFO	piecestore	downloaded	{"Piece ID": "6RSLYXDV4BOAQB2AG6JIMLJKGFEOC35MJESWI4F3FJQU6ODLISMQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-01-15T11:00:23.342Z	INFO	Configuration loaded	{"Location": "/app/config/config.yaml"}
2021-01-15T11:00:23.367Z	INFO	Operator email	{"Address": "[my@email.address]"}
2021-01-15T11:00:23.367Z	INFO	Operator wallet	{"Address": "0x45BB1F8F88B372AD67B37B32639A86870Fa7c3eB"}
2021-01-15T11:00:23.585Z	INFO	Got a signal from the OS: "terminated"
2021-01-15T11:00:43.388Z	ERROR	Invalid configuration.	{"error": "invalid contact.external-address: lookup \"[my.public.hostname]\" failed: lookup [my.public.hostname] on 192.168.1.1:53: read udp 172.17.0.8:43315->192.168.1.1:53: i/o timeout", "errorVerbose": "invalid contact.external-address: lookup \"[my.public.hostname]\" failed: lookup [my.public.hostname] on 192.168.1.1:53: read udp 172.17.0.8:43315->192.168.1.1:53: i/o timeout\n\tstorj.io/storj/storagenode.(*Config).Verify:154\n\tmain.cmdRun:155\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:385\n\truntime.main:204"}
2021-01-15T11:00:43.388Z	FATAL	Unrecoverable error	{"error": "invalid contact.external-address: lookup \"[my.public.hostname]\" failed: lookup [my.public.hostname] on 192.168.1.1:53: read udp 172.17.0.8:43315->192.168.1.1:53: i/o timeout", "errorVerbose": "invalid contact.external-address: lookup \"[my.public.hostname]\" failed: lookup [my.public.hostname] on 192.168.1.1:53: read udp 172.17.0.8:43315->192.168.1.1:53: i/o timeout\n\tstorj.io/storj/storagenode.(*Config).Verify:154\n\tmain.cmdRun:155\n\tstorj.io/private/process.cleanup.func1.4:363\n\tstorj.io/private/process.cleanup.func1:381\n\tgithub.com/spf13/cobra.(*Command).execute:842\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:950\n\tgithub.com/spf13/cobra.(*Command).Execute:887\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.ExecCustomDebug:70\n\tmain.main:385\n\truntime.main:204"}
2021-01-15T14:41:31.057Z	INFO	Configuration loaded	{"Location": "/app/config/config.yaml"}
2021-01-15T14:41:31.061Z	INFO	Operator email	{"Address": "[my@email.address]"}
2021-01-15T14:41:31.061Z	INFO	Operator wallet	{"Address": "0x45BB1F8F88B372AD67B37B32639A86870Fa7c3eB"}
2021-01-15T14:41:37.287Z	INFO	Telemetry enabled	{"instance ID": "12VHJRHeqGmnHsD3bcV2bkpfbv8jwER8NtX7h2wpgED8qDD6oHS"}
2021-01-15T14:41:38.150Z	INFO	db.migration	Database Version	{"version": 46}

(EDIT: Discourse seems to remove invalid UTF8 characters, but there is a series of ? characters before 2021-01-15T11:00:23.342Z)

I don’t get why invalid contact.external-address: lookup could cause the Node to fail? Is it not supposed to wait until it is reachable again?

This said, it seems the host machine restarted or something before Internet got lost… it’s so weird, I don’t get what happened…

Alexey · January 15, 2021, 5:41pm

As it says - you specified address [my.public.hostname] in the contact.external-address: of your config.
Since you use a docker version, you can remove this config.yaml and must specify the actual external address with port in the -e ADDRESS option.
To do so, stop and remove the container

docker stop -t 300 storagenode
docker rm storagenode

And run it again with all your parameters include changed ones.

littleskunk · January 15, 2021, 6:44pm

Something is wrong with your docker run command. Make sure the restart policy for this container is “unless stopped” so that the container will constantly restart in case it crashes.

Pac · January 15, 2021, 7:28pm

Well I never touched any config.yaml files, I only use parameters passed to the docker command directly.

Here is what I use for one of my nodes:

docker run -d --restart unless-stopped -p XXXX:28967 -p 127.0.0.1:YYYY:14002 \
-e WALLET=0xZZZZ \
-e EMAIL=[my@email.address] \
-e ADDRESS=[my.public.hostname]:XXXX \
-e STORAGE=4TB \
--mount type=bind,source=/[...]/storj/identities/storj_node_4,destination=/app/identity \
--mount type=bind,source=/[...]/storj/mounts/disk_4/storj_node_4,destination=/app/config \
--name storj_node_4 \
--mount type=bind,source=/[...]/storj/mounts/disk_2/storj_logs/storj_node_4,destination=/var/log \
storjlabs/storagenode:latest \
--log.output=/var/log/node.log \
--storage2.max-concurrent-requests=8

@littleskunk Yep, I always use unless-stopped so I’m not sure what’s going on

This node started correctly but failed again a couple of hours later with this:

2021-01-15T17:44:03.010Z        INFO    piecestore      upload canceled {"Piece ID": "YMCXM26LKP4XMAYS4MT6R2NJBDVLI4WSMA6LMKVBXFG5QK2QJBRA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT_REPAIR"}
2021-01-15T17:44:03.151Z        INFO    piecestore      downloaded      {"Piece ID": "A7HN5PDKXLADWT7I2P6LLMEKZNL3IQK5RJSGWPZBYIRKKZHQARYA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET"}
2021-01-15T17:44:03.509Z        ERROR   servers unexpected shutdown of a runner      {"name": "debug", "error": "debug: http: Server closed", "errorVerbose": "debug: http: Server closed\n\tstorj.io/private/debug.(*Server).Run.func2:108\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-01-15T17:44:05.992Z        FATAL   Unrecoverable error     {"error": "debug: http: Server closed", "errorVerbose": "debug: http: Server closed\n\tstorj.io/private/debug.(*Server).Run.func2:108\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Now it doesn’t start anymore. It always fails and stops the docker container (and it doesn’t restart).

Latest logs after a failed attempt at starting it again:

2021-01-15T19:23:42.307Z	INFO	piecestore	upload started	{"Piece ID": "HHQNRX7MIHJSOZCXJ37KTVKSKLIUBUW2G7UUJ5YSNFKEBNEGU2PQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Available Space": 3297638093824}
2021-01-15T19:23:42.535Z	INFO	piecestore	uploaded	{"Piece ID": "HHQNRX7MIHJSOZCXJ37KTVKSKLIUBUW2G7UUJ5YSNFKEBNEGU2PQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2021-01-15T19:23:54.497Z	INFO	piecestore	upload started	{"Piece ID": "BZ3BJQO4M5U6OKAU2CFAMDTIEB6YE5YWKNHCTIFP27DK7TP4FFLA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Available Space": 3297638092800}
2021-01-15T19:23:54.743Z	INFO	piecestore	uploaded	{"Piece ID": "BZ3BJQO4M5U6OKAU2CFAMDTIEB6YE5YWKNHCTIFP27DK7TP4FFLA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2021-01-15T19:24:01.841Z	INFO	Got a signal from the OS: "terminated"
2021-01-15T19:24:01.843Z	ERROR	piecestore:cache	error getting current used space: 	{"error": "context canceled; context canceled; context canceled; context canceled; context canceled; context canceled", "errorVerbose": "group:\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled"}
2021-01-15T19:24:02.342Z	ERROR	servers	unexpected shutdown of a runner	{"name": "debug", "error": "debug: http: Server closed", "errorVerbose": "debug: http: Server closed\n\tstorj.io/private/debug.(*Server).Run.func2:108\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-01-15T19:24:03.123Z	FATAL	Unrecoverable error	{"error": "debug: http: Server closed", "errorVerbose": "debug: http: Server closed\n\tstorj.io/private/debug.(*Server).Run.func2:108\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I have no idea why there is a “TERMINATED” signal from the host.
The RAM usage is way way below the maximum, even when this container tries to start.
The maximum I could see with top is 400MB of total usage out of 4GB.

I checked the following DB files with PRAGMA integrity_check; and the result seems OK:

Checking 'storj_node_4'...
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/bandwidth.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/heldamount.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/info.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/notifications.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/orders.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/piece_expiration.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/pieceinfo.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/piece_spaced_used.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/pricing.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/reputation.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/satellites.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/secret.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/storage_usage.db':
ok
Checking file '/XXXX/storj/mounts/disk_4/storj_node_4/storage/used_serial.db':
ok
Done.

I’m… very confused ^^’

Alexey · January 15, 2021, 7:37pm

Have you removed the config.yaml file?
Please, also check the filesystem on the disk with data.

INFO	Got a signal from the OS: "terminated"

this could be reaction on docker stop. Maybe your watchtower trying to update the storagenode?

Pac · January 15, 2021, 7:58pm

So!

First of all, my humblest apologies… If the node was keeping failing, it was my fault because I had an old CRON job that was shutting down my nodes in case the disk wasn’t available (that was before this check was built in the Node software), and today I made some changes on the filesystem which triggered this CRON job.

I now deleted this CRON job and… it works!

@Alexey You were on the right track, a software was shutting down the docker container, thanks a lot it helped me diagnose!

As a side note though @Alexey: Before solving the issue, I did try to delete config.yaml (well, I renamed it config.yaml.DELETE in fact, just in case): But starting the node was logging a lot of errors:

2021-01-15T19:42:15.702Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 7, "error": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused", "errorVerbose": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:141\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-01-15T19:42:16.546Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "attempts": 7, "error": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused", "errorVerbose": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:141\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-01-15T19:42:20.962Z	ERROR	contact:service	ping satellite failed 	{"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "attempts": 7, "error": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused", "errorVerbose": "ping satellite error: failed to dial storage node (ID: 1XHgoWbk6E2wrmi44x9UWsDyUMdf3Tu4kiAnjJnX7wmR1wLivS) at address [my.public.hostname]:1467: rpc: dial tcp 90.76.151.46:1467: connect: connection refused\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:141\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
[...]

Then, I restored the config.yaml and everything went back to normal.
So apparently deleting config.yaml is not cool ^^’

EDIT: Maybe you weren’t suggesting to delete this file in fact, maybe I misunderstood ^^
Anyway, the node’s back to business now

Alexey · January 15, 2021, 8:09pm

Yes, you should run once with -e SETUP=true to re-create a default config.yaml. So simple deleting of config.yaml is not the case anymore.
Thanks!

Pac · January 15, 2021, 8:21pm

Aaaah right, that explains what I experienced.
Many thanks

Pac · January 17, 2021, 1:42pm

Just to be sure, I took the time to check database files of all my nodes though (with PRAGMA integrity_check;) and there is one file that did not return OK:

Checking file '/XXXX/storj/mounts/disk_3/storj_node_3/storage/orders.db':
*** in database main ***
On tree page 77 cell 4: Rowid 3088 out of order
row 1 missing from index idx_order_archived_at
row 2 missing from index idx_order_archived_at
row 3 missing from index idx_order_archived_at
row 4 missing from index idx_order_archived_at
[...]
row 96 missing from index idx_order_archived_at
row 97 missing from index idx_order_archived_at
row 98 missing from index idx_order_archived_at
row 99 missing from index idx_order_archived_at

Should I be worried? Does it need fixing?

Alexey · January 18, 2021, 9:58pm

Yes, it need to be fixed, otherwise you could have a memory leaks and some other weird issues and can do not receive a payout for bandwidth in worst case because of unsent orders.

Pac · January 18, 2021, 10:07pm

I guess I just need to follow the regular guide:

Alright thx @Alexey, I’ll give it a try.

Pac · January 19, 2021, 2:49pm

Went through the fixing process and my orders.db file is now OK when running a PRAGMA integrity_check;.

Thanks again @Alexey

I’m wondering why Nodes do not run an integrity check on DB on their own time to time though, and try fixing it. That would be really nice. And if it cannot fix it, an alert could be raised somehow (on the dashboard at least for instance).