All nodes crash at the same time

batelis · September 7, 2024, 5:38pm

Hey so I set up few nodes on a new pc and I see this thing happening everyday on all 3 nodes 2 disks are used and 1 brand new and all of them crash at the same time
all of them with same fatal error

2024-09-04T19:57:56+03:00	ERROR	failure during run	{error: piecestore monitor: timed out after 1m0s while verifying readability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(Service).Run.func1.1:153\n\tstorj.io/common/sync2.(Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(Service).Run.func1:140\n\tgolang.org/x/sync/errgroup.(Group).Go.func1:78}
2024-09-04T19:57:56+03:00	FATAL	Unrecoverable error	{error: piecestore monitor: timed out after 1m0s while verifying readability of storage directory, errorVerbose: piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(Service).Run.func1.1:153\n\tstorj.io/common/sync2.(Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(Service).Run.func1:140\n\tgolang.org/x/sync/errgroup.(Group).Go.func1:78}
I know Alexey will say disks are too slow but there is nothing else running on them they running on almost 0% all day after they finish filewalkers
this is windows btw.

Knowledge · September 7, 2024, 5:51pm

Well, it says it timed out waiting for a response back from the drive. So, the drive didn’t return the data in one minute of time. Perhaps your drive is going to sleep at that time due to some power setting? Or your drive is indeed slow.

Stez · September 7, 2024, 5:53pm

Something in your pc (assuming the disks are internal) is a bottleneck. You could check Process Monitor (from MS) to check the I/O and/or CPU/RAM usage. That might give you some clues.

batelis · September 7, 2024, 5:53pm

But on all nodes at the same time not like 10minutes difference, same minute same second.
Sleep settings set to never sleep too

Julio · September 7, 2024, 6:27pm

There are only two devices that can freeze/seize up Windows, GPU driver & HDD. In case of the latter, you may not necessarily notice your entire disk subsytem to be frozen, as long as there is ram the kernel will continue to service, system will appear fine - but upon any disk access, will freeze until the bus /flush/clears/reconnects.

#1. Check your OS drive for errors;
#2. Check your three remaining drives, any one drive can freeze a system.
#3. If any of those are dynamic .vhdx, and you have previously expanded any volumes on them.

2 cents

batelis · September 7, 2024, 6:29pm

I did have few bluescreens before that might be it, and my taskbar doesnt work right now aswell…

Julio · September 7, 2024, 6:31pm

Blue screen, be nice to know the specific error. Nevertheless, DISM then SFC scan.

batelis · September 7, 2024, 6:32pm

it was karnel power I think might be ram might be psu but I only got 1 bluescreen I think

Julio · September 7, 2024, 6:32pm

Task Manager then re-start explorer.exe tasks.

Julio · September 7, 2024, 6:34pm

Do DISM online repair first, then SFC /scannow after that, on your OS drive.

GL

batelis · September 7, 2024, 6:35pm

ran chkdsk and ut didnt find anything

batelis · September 7, 2024, 6:38pm

With scannow it did find corrupt files, lets see if that works. Thank you.

Julio · September 7, 2024, 6:45pm

If you don’t do DISM repair first, even though SFC finds errors it may not solve your problems. If you don’t go back and do DISM online repair, then SFC. The DISM repairs the repository that SFC uses to do it’s repairs.

But sounds good enough for now…

batelis · September 7, 2024, 6:59pm

dism check didnt find any errors.

daki82 · September 8, 2024, 8:24pm

You may find more information here,

my tip is higher timeouts and defragmentation of the mft.
temporarily(?) loglevel to fatal

oh, and maybe update bios, network and mainboard driver.

JWvdV · September 8, 2024, 9:28pm

My steps would be:

CHKDSK
Make sure the drive had enough power / disable APM
Disable AAM or set to performant node, if the disk supports it
Change to badger cache.
Increase timeout to 5m0s or something

To be sure: are all nodes running on the same disk?

batelis · September 9, 2024, 6:08am

You see its a diffrent problem here all nodes crash at the same time with same error. This had to be something else and in my case sfc /scannow helped and nodes didnt crash in 2 days.

batelis · September 9, 2024, 6:09am

sfc /scannow helped while chkdsk didnt find anything wrong and I do have my pc on power saver. My cpu goes down to 1.5 ghz sometime that might be too slow.

JWvdV · September 9, 2024, 6:25am

Yeah, so I think it might be underpower. Si hoe are they powered? How many disks/ hoe many Watt?
And are all nodes running at different drives?

JWvdV · September 9, 2024, 6:26am

Don’t think so, have about 35 nodes on one CPU (N100) idling 20% of the time.