Fatal Error on my Node

Alexey · April 21, 2023, 4:19pm

Depending on error and what timeout you have changed.
Please post your error.

But basically - you need to explicitly save the config file (menu File - Save), then restart the storagenode service either from the Services applet or from the elevated PowerShell

Restart-Service storagenode

snorkel · April 21, 2023, 6:25pm

And uncomment that line, by deleting the # sign and space from the begining.

Teclox · April 25, 2023, 5:31pm

But if i uncomment on the logs it says somrthing like uknow file configuration for y at line x.

Alexey · April 26, 2023, 3:14am

You should uncomment only option, not the description. The option should not have a leading spaces.

BartLanz · April 30, 2023, 3:48pm

I have had this issue for a couple of weeks.

I thought it was something I had done. But the only change I have made to the box is getting rid of software from the server.

The Storj server is hosted on an ESXi VM running Server 2019 in a PowerEdge T430 on a Raid 5 array with SAS drives. I am hosting 4TB of data, and have been for over a year in this VM.

The software removed was BlueIris (NVR software) and I expected it would improve performance not hurt. My system is going offline Daily between 3 and 4, with the time out error.

Short of being bare metal I cant imagine the system could get better than this for hardware.

Alexey · May 1, 2023, 3:09am

Hello @BartLanz,
Welcome to the forum!

The latency of the disk subsystem plays role - if the disk cannot read or write a file even after a one minute, this is not normal.
Perhaps you use NFS or SMB/CIFS to connect your disk to the VM, it has a great latency and doesn’t work well with storagenode. So it’s better to connect disk directly to the VM without usage of network filesystems.
Raid5 is known to be slow on writes (as slow as the slowest disk in the array + RAID overhead), so it add a latency even more.
You likely have writeable timeouts. In this case you need to increase the writeable timeout on 30 seconds first, save the config and restart the node. If it would not help - increase more. If you have a readability timeouts, you need to increase a readable timeout and the check interval.

bney · May 1, 2023, 1:28pm

Running 1.77.2 and am suddenly getting the following error every few hours causing the node to crash

2023-05-01T06:09:47.337-0400 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Knowledge · May 1, 2023, 3:27pm

Hi Bney,

I would first see if you have a drive issue, make sure you can write files to the device and everything is behaving as it should be. If after that, I would increase the timeout thresholds as explained above to see if that helps.

Craig · May 1, 2023, 8:31pm

So an update on my errors. Since removing the on-start filewalker (my drives are Storj exclusive) my notes have been mostly stable. I’ve still had one node crash a couple times but it was for an out of memory condition, not surprising where I’m running three nodes on a measly Raspberry Pi 3 with only 1GB of RAM. I’m curious to see if the 1.78 lazy-filewalker ability will help me out and allow me to re-enable the on-start filewalker.

Toolhead · May 1, 2023, 10:31pm

I got the same issue on my 3 nodes “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”.

I also recognized, that I can`t stop my nodes by stopping the windows service “Storj V3 Storage Node”, I need to stop storagenode.exe in Task Manager.

Edit: Ah, I see the issue was mentioned in another thread, I`ll wait for the next update

Alexey · May 2, 2023, 2:42am

For the slow disk you may increase a writeable timeout a little bit and restart the node. Of course because of issue Storagenode 1.77.2 wont stop you can do restart only after crash.

Alexey · May 4, 2023, 2:41am

A post was merged into an existing topic: Windows node update Invalid configuration file key

daki82 · May 5, 2023, 4:08am

did not help, was hit with the timeout error once again after 300h+ no problems.
as soon as ingres starts resuming >30gb/day it occours magicaly

Vadim · May 5, 2023, 6:01am

this mean that your node dont work fast enough, may be defragmentation?

daki82 · May 5, 2023, 7:14am

defragmentation was running 5 days in the background, i canceled it, but it defragmented most of the data.
now i will turn on free space and wait. with timeout “2m10s”. “piecescan on startup false”

i activated ingress again restarted the node and within some hours

since im on v1.76.2 it shows something different:

-the node process was using unusual high cpu, 10x the normal amount,
-dasboard was taking unusual long to load, i restarted the node manually then dashboard was normal
-i got suspension 95% too
-still no audit errors
-uptime robot did not detect any error, (maybe because automatic restart)
-online was affected shortly after
-timeout was 1m in logs (maybe i got the line wrong, could not find an # line in .yaml for that)

so defragmentation helped not with the error, but with general performance of the disk.

risksoft · May 6, 2023, 1:16pm

I have files for chia on the same drive and no problems with it. but stroj constantly crashes. how can you even imagine that the hard drive will not respond for more than a minute ?
in the resource manager i have a response maximum of 100ms and in the logs when it crashes more than 1 minute ? a bug in the storj software

Alexey · May 7, 2023, 3:06am

please try to stop chia and check - will storagenode crash or not

risksoft · May 7, 2023, 7:53am

chia off
wsearch off
defrag is off

noda crashes every 2-10 hours. no problems with the disk (victoria 3.7 test), no errors in any system logs. chia mining disk requirements - 5 seconds and everything works. your software constantly crashes. if the disk does not respond 1 minute - then the disk is dead, and if it is alive - then the software does not work correctly.

daki82 · May 7, 2023, 8:10am

i still go with some ungabunga inside code.
it is showing on windows first, since docker gets automatic restarts per default. covering it up.
full nodes not affected, so everybody thinks its fine tho.

Alexey · May 7, 2023, 8:31am

Please try to do a defragmentation and do not disable it for the data location.
See also

The problem with not reliable readability/writeability was there before, just checkers did not have a timeout, so for partial hanging it was not effective and some such nodes were disqualified in the past.
Now the timeout is in place and crashes node in such half-hanging state.

However, you may disable it, and make it behave as before - just increase a timeout to some big number like a week.
But I think it’s better to know that there is some problem with disk subsystem than close eyes on it.