Fatal Error on my Node

Alexey · November 9, 2023, 4:30am

IgRost:

2023-11-08T16:40:19+03:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

gnegnus · November 23, 2023, 6:46am

Hello,
I run a new node from Italy since 12 october 2023.
Everything was going fine since 3/4 days ago.

Now every morning my node shutdown with this fatal error.

2023-11-22T08:14:33+01:00 ERROR piecestore upload failed {“Piece ID”: “TMXS24UQQT2LX4G42G5QWC6C7BT2CMBKIBQRXSM6VJHKUCYKNHSQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “error”: “manager closed: unexpected EOF”, “errorVerbose”: “manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:226”, “Size”: 262144, “Remote Address”: “5.161.143.41:55470”}
2023-11-22T08:17:24+01:00 ERROR services unexpected shutdown of a runner {“name”: “piecestore:monitor”, “error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-11-22T08:17:35+01:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

I’ve setup storagenode service to restart automatically and now I’m going offline because after restart everything works fine.

I’ve an IBM Microserver Gen8 with 4x16TB in RAID 5.
OS: VMWare ESXi 7 running only one VM - Windows Server 2022 that is hosting Storj.
The file data are in a VMDK disk inside the RAID5 array.

I’m hosting only 1.19 TB of data

Any suggestion?
Thanks in advance to all.

Alexey · November 23, 2023, 7:19am

a.j.thiart · December 13, 2023, 9:16am

Hi all. My node doesn’t stay online. I constantly have to restart it. What can I change in the config file to have it stay online? Please help. My satellites are getting suspended

tylkomat · December 13, 2023, 9:58am

The error message suggests that your storage directory is not writable. Therefore the node stops working. Where is your storage directory located? Is it a local hard drive, connected via USB or something else?

daki82 · December 13, 2023, 11:34am

Easy, read here

@Alexey i like to move it move-it

And here…

Alexey · December 14, 2023, 2:55am

thelastspark · December 19, 2023, 2:02pm

Hi all,

I have a ~7tb node that’s been running without issue, but recently started to have the following entries in the log:

2023-12-19T07:44:40-05:00	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T07:44:40-05:00	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:192\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:58\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T07:45:07-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T07:45:09-05:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

2023-12-19T08:27:11-05:00	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T08:27:59-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T08:27:59-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T08:28:01-05:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

It seems to keep working after, one time I needed to restart my machine to fix it - not sure what could be causing this?

P.S.
Here is the crystal disl info of the drive

Roxor · December 19, 2023, 3:07pm

If you still have free space… did the drive go read-only? Your Windows Event Log may show something.

thelastspark · December 19, 2023, 3:10pm

it’s a windows 10 machine - where/which log should I check? And yea, there’s still enough free space on it:

Roxor · December 19, 2023, 3:13pm

I’d open Event Viewer, and look in the Custom Views → Administrative Event, and Windows Logs → System logs.

Those logs are often full of crap that you don’t really have to worry about: but maybe the System one will show errors related to your disk?

thelastspark · December 19, 2023, 3:23pm

Honestly I didn’t see anything there that would indicate the disk being involved…

arrogantrabbit · December 19, 2023, 3:48pm

Amount of disk IO in the resource monitor? Disk response latency when this happens?

Looks like your disk can’t keep up when IO slightly increases. Look into PrimoCache.

Also run check disk to verify filesystem integrity.

thelastspark · December 19, 2023, 7:18pm

which flags should I run for the checkdisk? It’s a standalone drive just for the node, not the filesystem.

The IO load seems to be fine?

daki82 · December 19, 2023, 7:22pm

Its related to slow disk subsystem or fragmentation.

Did you ever defrag the node?
Also various solutions here:

Locks like you defrag the D drive atm.

arrogantrabbit · December 19, 2023, 7:31pm

Looks fine right now. But you are not seeing issues right now, right? You may want to monitor IOPs and latency on the disk long term (e.g. with performance monitor) and when the issue reproduces review what was the load at the time issue has occurred.

Because your disk can service very small number of IOPs (around 200) it will be absolutely fine under that and faceplant when load goes over. Storagenode load is variable.

You may want to move databases to a separate drive, ideally SSD, if you haven’t done so yet. This will remove massive amount of IOPs.

Stop the node and

chkdsk z: /f

This produces a very small improvement if any, and is not worth wasting time doing.

daki82 · December 19, 2023, 7:35pm

Any proof for that? And why do some people here recommend it?
NTFS with nodes mess this up over time, pushing seek time up, when walking the millions of files.

thelastspark · December 19, 2023, 7:39pm

I am defragging a different drive atm, but they are completely independent. And my cpu usage is below 25%

thelastspark · December 19, 2023, 7:40pm

The databases are on a different SSD, so the drives only deal with the data itself. I’ll run the check disk command now though

daki82 · December 19, 2023, 7:45pm

Ok, if its without errors, maybe consider:

-Defragmentation (or just do an analyse to see how bad it is)
-increasing the timeouts (had to do that too at around 7TB)