Fatal Error on my Node / Timeout after 1 min

Hogan1337 · March 30, 2023, 5:23am

Sorry for the missleading writing. I got 3 nodes with 3 Disks Every node got one Disk. But it is the same Disk.

Hogan1337 · March 30, 2023, 5:30am

I can´t find any issues with the disk. But I ordered a new one to change it and check if it is the problem.

Hogan1337 · March 30, 2023, 5:31am

I want to try It as well how do I downgrade it?

daki82 · March 30, 2023, 5:54am

Dear Alexey, it makes no sense, to increase the version number this much for 1 feature?
(whose 5min not match the 1min0sec logged error)

, also why should a corrupted disk cause the nodesoftware to crash, but on the other hand working normaly (still no audit fails) but suspension and online affected while restarted.
ithink knowledge escalated to devs is the right way

snorkel · March 30, 2023, 6:56am

Stop and remove node, than manual update. It will get most probably the 1.74 version, not 1.75 that whatchtower got.

digitalfrank · March 30, 2023, 7:22am

Even to myself, the error is the service stops and I have to restart it on windows. Please check.

pcresumen · March 30, 2023, 7:23am

Today all the missing nodes have been updated to version 1.75.2 and for the moment, they have not given me any errors. Only 1 has been giving these problems but for now with the solution that I explained, it continues to work. Now the hard drive is between 10-20% usage and it seems that it has no intention of giving the error.

Before you try to downgrade, try disabling file checking from config.yaml. Surely you must have the hard disk at 100% use at each reboot until it gives the error.

Look for the variable “storage2.piece-scan-on-startup” uncomment it and set it to “false”, it should look like this:

storage2.piece-scan-on-startup: false

Let’s see if with that the hard drive doesn’t go to 100% and doesn’t give you the error.

On April 1, if it hasn’t given any errors, I’ll put the correct capacity back to the node and I’ll tell you if it went well.

Alexey · March 30, 2023, 7:42am

There are other changes as well. I said that this only a feature which now has a timeout on dir verification.
You may increase a timeout, but please be careful - do not put more than 5 minutes, otherwise you will risk to start to fail audits.
The parameter is called

PS C:\Users\user> & 'C:\Program Files\Storj\Storage Node\storagenode.exe' setup --help | sls verify
      --storage2.monitor.verify-dir-readable-interval duration   how frequently to verify the location and readability of the storage directory (default 1m0s)
      --storage2.monitor.verify-dir-readable-timeout duration    how long to wait for a storage directory readability verification to complete (default 1m0s)
      --storage2.monitor.verify-dir-writable-interval duration   how frequently to verify writability of storage directory (default 5m0s)
      --storage2.monitor.verify-dir-writable-timeout duration    how long to wait for a storage directory writability verification to complete (default 1m0s)

daki82 · March 30, 2023, 12:00pm

Sorry,i missunderstood that. English is not my first language.

Maybe the errors startet from 22.to 23.3.23

Since the nodesoftware gets restarted automatic its a bit more quiet.

I noticed the teamviever connection from my other pc gets stuck if i let it minimized.
With no errors and reconnecting works everytime.
Maybe something at the udp network is stuck sometimes causing the node to hang for a bit more than one minute?? Happened not until i checked the node more often. Because of the unexpected closing.

Chucky90 · March 30, 2023, 4:48pm

I have exactly the same problem.

Node runs on Windows 10
WD 8TB not yet full

Since the update to 1.75.2, the Node keeps crashing randomly with the same error message.

Even chkdsk D: /f has not detected any problems on the hard drive.

Deleting the trust-catch file did not help either.

Dave-Baldwin · March 30, 2023, 5:32pm

You will probably be instructed by someone here (you will see it if you scroll up) to do a chkdsk /B in order to check for bad sectors. Just know that will take many hours. I started it last night and canceled it, ETA on my 5 TB drive (almost full) was 65+ hours. So you will have to judge whether your online scores can tolerate such a very long outage if your scores have already been impacted by frequent node crashes.

I am also planning to extend the readability/writeability timeouts to 4m30s to see if this resolves the frequency crashes. Also be aware that you may need to manually add a new parameter to your config.yaml in order to do this (see @Alexey post above: Fatal Error on my Node - #72 by Alexey). That parameter was not in my config.yaml file (apparently the node updater grabs new .exe files but does not intelligently add newly available configuration parameters to the bottom of the config.yaml)

sland99 · March 30, 2023, 6:21pm

I have the same problem

3-30T19:03:57.543+0300 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying readability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1.1:137\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:133\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Node run on Windows 10, GUI
After starting the node runs for a couple of minutes, and the CPU load rises to 100%. Then the node stops and generates an error in the log

BrandonE · March 30, 2023, 7:05pm

I feel like i’ve got this problem as well. My node went down twice out of the blue. The first time i restarted the pc. The second time i admin powershell to stop storagenode and started storagenode back up and it runs normal again. Anyone mind helping me figure out where to find any logs? I installed it default, and its a windows 11 pc. Thanks in advance.

snorkel · March 30, 2023, 8:07pm

In Windows, the logs for SN are in the installation folder, next to storagenode.exe. Probably C:\Program Files\Storj… or something. I can’t remember exactely.

snorkel · March 30, 2023, 8:12pm

Please all of you, who had crash/restarts, specify the exact setup - OS, Docker or not, hardware, RAM, type of HDD (CMR/SMR), type of port for HDD, CPU, Storragenode ver., TCP Fast Open enabled/not, log.level.
It seems only Windows nodes have problems; maybe we find another common denominator.

daki82 · March 30, 2023, 8:50pm

As written before:
it is windows and the new nodeupdate but not all nodes, also the starting point at ca 22-24.03.23
mostly nodestats impacted at suspension and online; NOT at audit (still 100%).
drive behaves normaly; no errors to scan, hardware no similarities. no surprising windows errorlogs
after nodesoftware restarted(or if configured automaticaly after 1min from windows) even in less than <3min it works fine for hours (~23h) 1-2 crashes a day, logfiles not to big.(some MB)

daki82 · March 30, 2023, 8:52pm

its an txt file in the folder mentioned named storagenode.txt
maybe its to big to open, hen stop the node and delete it, restart and the node creates a new one to copy and read.

litvinovov · March 31, 2023, 3:32am

Hello!

Пулучил такую же ошибку после обновления до v1.75.2, служба останавливалась 2-3 раза в сутки.
Откатился до версии v1.74.1 и выключил службу обновления. Ошибка пропала, узел работает уже больше суток без сбоев, Suspension вернулся к 100%.

Конфигурация:
Win 10
HDD 8TB Seagate CMR, проверку сделал, SMART OK
RAM 4 GB
CPU Celeron J1800
Используется только для storj

До этой ошибки узел работал исправно.
У меня есть еще несколько узлов, которые обновились до v1.75.2 у них все хорошо. Начал сбоить только один узел.
Пока этот узел оставлю на версии v1.74.1 и буду наблюдать за ситуацией.

I got the same error after updating to v1.75.2, the service stopped 2-3 times a day.
I rolled back to version v1.74.1 and turned off the update service. The error disappeared, the node has been working for more than a day without failures, Suspension has returned to 100%.

Configuration:
Windows 10
HDD 8TB Seagate CMR, checked, SMART OK
RAM 4 GB
CPU Celeron J1800
Only used for storj

Prior to this error, the node was working properly.
I have a few more nodes that have updated to v1.75.2 and are doing well. Only one node started to fail.
For the time being, I will leave this node on version v1.74.1 and will observe the situatione.

digitalfrank · March 31, 2023, 4:19am

As already written a few posts ago, I have the same error as the others “Fatal error…” and the node stops and I have to repair the service. As suggested I write here my configuration:
Node on Windows 10 VM on Esxi 6.5U2
It’s a node on a HP proliant 380 g8 server with data HDD associated with the VM as Raid 0. The disk is HGST hus724040als640 sas so definitely CMR
VM with 4 Core 2.5Ghz CPUs and 4GB Ram
10GB Lan and 2.5Gbit Down and 1Gbit Up Fiber internet connection

Sorry i add that this error is on 4 of my 20 nodes and the smart of all nodes is ok

andy777 · March 31, 2023, 6:54am

I’ve got the same issue. Began since updating to 1.75.2
Happens every few hours.
Windows 8 Pro
i7 6th gen
16Gb
18Tb Seagate ST18000NM000J-2TV103 internal
1Gbit
chkdsk found no errors
This node is up since the beginning of Storj project.
I have another 2 Windows and 1 docker nodes running 1.75.2 - so far no issues.