V1.80.10 Node is restarting daily

I am running two nodes. One on version 1.79.4, and one on V 1.80.10. Both have been running without issue for months.

Both nodes use iSCSI on a dedicated network to a trunas server. The node running 1.80.10 is now stopping once a day. I’ve checked the Trunas server, and there are no issues reported. All disks show healthy. Nothing in the logs to indicate an issue. Everything is on a ups. This only started with the recent update to V1.80.10.

Is anyone else having this issue? Again, it only just started with the update. Prior I’ve had uptimes in the hundreds of hours.

Here are some log entries. Any advice is appreciated.

{“L”:“ERROR”,“T”:“2023-06-19T02:38:57.213-0400”,“N”:“services”,“M”:“unexpected shutdown of a runner”,“name”:“piecestore:monitor”,“error”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory”,“errorVerbose”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:163\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:155\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

{“L”:“WARN”,“T”:“2023-06-19T02:39:12.223-0400”,“N”:“servers”,“M”:“service takes long to shutdown”,“name”:“server”}

{“L”:“FATAL”,“T”:“2023-06-19T02:39:21.462-0400”,“M”:“Unrecoverable error”,“error”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory”,“errorVerbose”:“piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:163\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:155\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Have a look at this thread

Update,
I’ve been checking everything else, and I noticed that the node’s data drive was full. There were temp files outside of the node folder. They seem to be unrelated to storj. I removed them. Hopefully that was the issue. I’ll report back in a few hours.
Thank you all

1 Like

The network attached drive and slow writeability, nothing unusual, they are born together.
If you use an NTFS, it could help to defragment it.
If the stops would still happens, you would need to increase a writeability timeout on 30s, as described in the linked topic, save the config and restart the node.

Please note, it could also stop because of readability timeout too (reads through network could be slow as well), so please read an error to understand what’s parameter need to be updated.

Update: This is NOT a storj issue. The iSCSI drive had files outside of the storj folder, and the drive was full. A separate server was saving logs there without me realizing it. The drive now has 1 TB of free space. Everything seems stable. I will be more through in the future before posting. Thank you all anyway for the support.

4 Likes