V1.80.10 Node is restarting daily

mfinnerty · June 19, 2023, 12:35pm

I am running two nodes. One on version 1.79.4, and one on V 1.80.10. Both have been running without issue for months.

Both nodes use iSCSI on a dedicated network to a trunas server. The node running 1.80.10 is now stopping once a day. I’ve checked the Trunas server, and there are no issues reported. All disks show healthy. Nothing in the logs to indicate an issue. Everything is on a ups. This only started with the recent update to V1.80.10.

Is anyone else having this issue? Again, it only just started with the update. Prior I’ve had uptimes in the hundreds of hours.

Here are some log entries. Any advice is appreciated.

{“L”:“ERROR”,“T”:“2023-06-19T02:38:57.213-0400”,“N”:“services”,“M”:“unexpected shutdown of a runner”,“name”:“piecestore:monitor”,“error”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory”,“errorVerbose”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:163\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:155\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

{“L”:“WARN”,“T”:“2023-06-19T02:39:12.223-0400”,“N”:“servers”,“M”:“service takes long to shutdown”,“name”:“server”}

{“L”:“FATAL”,“T”:“2023-06-19T02:39:21.462-0400”,“M”:“Unrecoverable error”,“error”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory”,“errorVerbose”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:163\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:155\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

nerdatwork · June 19, 2023, 12:55pm

Have a look at this thread

mfinnerty · June 19, 2023, 1:37pm

Update,
I’ve been checking everything else, and I noticed that the node’s data drive was full. There were temp files outside of the node folder. They seem to be unrelated to storj. I removed them. Hopefully that was the issue. I’ll report back in a few hours.
Thank you all

Alexey · June 20, 2023, 4:46am

The network attached drive and slow writeability, nothing unusual, they are born together.
If you use an NTFS, it could help to defragment it.
If the stops would still happens, you would need to increase a writeability timeout on 30s, as described in the linked topic, save the config and restart the node.

Please note, it could also stop because of readability timeout too (reads through network could be slow as well), so please read an error to understand what’s parameter need to be updated.

mfinnerty · June 20, 2023, 12:13pm

Update: This is NOT a storj issue. The iSCSI drive had files outside of the storj folder, and the drive was full. A separate server was saving logs there without me realizing it. The drive now has 1 TB of free space. Everything seems stable. I will be more through in the future before posting. Thank you all anyway for the support.