Fatal Error on my Node

1 Like

Hello,
I run a new node from Italy since 12 october 2023.
Everything was going fine since 3/4 days ago.

Now every morning my node shutdown with this fatal error.

2023-11-22T08:14:33+01:00 ERROR piecestore upload failed {ā€œPiece IDā€: ā€œTMXS24UQQT2LX4G42G5QWC6C7BT2CMBKIBQRXSM6VJHKUCYKNHSQā€, ā€œSatellite IDā€: ā€œ12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3Sā€, ā€œActionā€: ā€œPUTā€, ā€œerrorā€: ā€œmanager closed: unexpected EOFā€, ā€œerrorVerboseā€: ā€œmanager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:226ā€, ā€œSizeā€: 262144, ā€œRemote Addressā€: ā€œ5.161.143.41:55470ā€}
2023-11-22T08:17:24+01:00 ERROR services unexpected shutdown of a runner {ā€œnameā€: ā€œpiecestore:monitorā€, ā€œerrorā€: ā€œpiecestore monitor: timed out after 1m0s while verifying writability of storage directoryā€, ā€œerrorVerboseā€: ā€œpiecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75ā€}
2023-11-22T08:17:35+01:00 FATAL Unrecoverable error {ā€œerrorā€: ā€œpiecestore monitor: timed out after 1m0s while verifying writability of storage directoryā€, ā€œerrorVerboseā€: ā€œpiecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75ā€}

Iā€™ve setup storagenode service to restart automatically and now Iā€™m going offline because after restart everything works fine.

Iā€™ve an IBM Microserver Gen8 with 4x16TB in RAID 5.
OS: VMWare ESXi 7 running only one VM - Windows Server 2022 that is hosting Storj.
The file data are in a VMDK disk inside the RAID5 array.

Iā€™m hosting only 1.19 TB of data

Any suggestion?
Thanks in advance to all.


Hi all. My node doesnā€™t stay online. I constantly have to restart it. What can I change in the config file to have it stay online? Please help. My satellites are getting suspended

The error message suggests that your storage directory is not writable. Therefore the node stops working. Where is your storage directory located? Is it a local hard drive, connected via USB or something else?

Easy, read here

@Alexey i like to move it move-it :joy:

And hereā€¦

Hi all,

I have a ~7tb node thatā€™s been running without issue, but recently started to have the following entries in the log:

2023-12-19T07:44:40-05:00	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T07:44:40-05:00	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:192\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:58\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T07:45:07-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T07:45:09-05:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

2023-12-19T08:27:11-05:00	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-12-19T08:27:59-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T08:27:59-05:00	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-12-19T08:28:01-05:00	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

It seems to keep working after, one time I needed to restart my machine to fix it - not sure what could be causing this?

P.S.
Here is the crystal disl info of the drive

If you still have free spaceā€¦ did the drive go read-only? Your Windows Event Log may show something.

itā€™s a windows 10 machine - where/which log should I check? And yea, thereā€™s still enough free space on it:
image

Iā€™d open Event Viewer, and look in the Custom Views ā†’ Administrative Event, and Windows Logs ā†’ System logs.

Those logs are often full of crap that you donā€™t really have to worry about: but maybe the System one will show errors related to your disk?

Honestly I didnā€™t see anything there that would indicate the disk being involvedā€¦

1 Like

Amount of disk IO in the resource monitor? Disk response latency when this happens?

Looks like your disk canā€™t keep up when IO slightly increases. Look into PrimoCache.

Also run check disk to verify filesystem integrity.

which flags should I run for the checkdisk? Itā€™s a standalone drive just for the node, not the filesystem.

The IO load seems to be fine?

Its related to slow disk subsystem or fragmentation.

Did you ever defrag the node?
Also various solutions here:

Locks like you defrag the D drive atm. :stuck_out_tongue_winking_eye:

Looks fine right now. But you are not seeing issues right now, right? You may want to monitor IOPs and latency on the disk long term (e.g. with performance monitor) and when the issue reproduces review what was the load at the time issue has occurred.

Because your disk can service very small number of IOPs (around 200) it will be absolutely fine under that and faceplant when load goes over. Storagenode load is variable.

You may want to move databases to a separate drive, ideally SSD, if you havenā€™t done so yet. This will remove massive amount of IOPs.

Stop the node and

chkdsk z: /f

This produces a very small improvement if any, and is not worth wasting time doing.

Any proof for that? And why do some people here recommend it?
NTFS with nodes mess this up over time, pushing seek time up, when walking the millions of files.

I am defragging a different drive atm, but they are completely independent. And my cpu usage is below 25%

The databases are on a different SSD, so the drives only deal with the data itself. Iā€™ll run the check disk command now though

1 Like

Ok, if its without errors, maybe consider:

-Defragmentation (or just do an analyse to see how bad it is)
-increasing the timeouts (had to do that too at around 7TB)