Fatal Error on my Node / Timeout after 1 min

daki82 · March 31, 2023, 7:13am

so i did no changes, but the node is up 36h now. will update here if it stops again. still 1.75.2

Alexey · March 31, 2023, 7:37am

It probably depends on the load. If it’s too high - your disk is likely cannot keep up.

Alexey · March 31, 2023, 7:42am

The problem is that your disk is still too slow to respond and you likely could have problems with audits if load will be a little bit higher.
And if the difference between version will be more than 2, it will stop receive any ingress.
I would recommend to add 30s to the storage2.monitor.verify-dir-writable-timeout parameter to make it

storage2.monitor.verify-dir-writable-timeout: 1m30s

save the config and restart the node either from a Services applet or from an elevated PowerShell:

Restart-Service storagenode

на русском

К сожалению, вы не решили реальную проблему, что диск отвечает медленно во время проверки записи. Это значит, что в лучшем случае вы не получаете часть данных, а в худшем - узел может проваливать аудиты.
Откат версии не поможет - через две версии ваш узел перестанет принимать трафик.

Могу посоветовать увеличить таймаут на 30 секунд, может быть этого будет достаточно, чтобы позволить вашему диску ответить вовремя

storage2.monitor.verify-dir-writable-timeout: 1m30s

Сохраните конфиг и перезапустите узел
В PowerShell от имени администратора:

Restart-Service storagenode

Hogan1337 · March 31, 2023, 7:51am

Trying this for today with 4m0s. But the StorJ Team must understand that not many guys will provide space on SSDs and if HDDs are to slow you will not gain enough space I think.

Alexey · March 31, 2023, 7:57am

This is not about the speed. It’s about your safety. If disk is not responding because of a failure or your hardware has issues, on all previous versions your node will be disqualified for not answering the audit requests (because request will hangs forever, but the audit request has a timeout of 5 minutes and it will try two more time to request the same piece before considering this audit as failed). Too many failed audits and your node is DQ. At the moment it’s a few hours.

So adding this timeout to the writeability and readability is to protect your node from disqualification, if it has temporary hardware problems.

Asura99 · March 31, 2023, 8:11am

So my Node is operating normal again … i dont know why … online +22h ( i make a daily restart on Windows (made it before the crashes aswell))
also my Dashboards updates correct.

so i changed a bit .

log.level info to error only ( just check if the storagenode logfile is more than some GB … mine was before 4gb… ) , stop the service , archived the existing storagenode file . restart service and it automatically generates a new one

maybe check the → orders folder ← archive / unsent , whats more than 2days old , i also deleted them . maybe it backlooped or so … idk…

under windows services , i just enabled the task from storage node , that if its crashes or has a problem , it will automatically restarts the service or you can - restart windows / activate a script

specs from the node :

WD 8 TB - CMR , external power , connected per usb 3.0
Mini Pc , Windows 10 , 4gb ram, intel atom x5 z8350 , node operating around 4months , no errors . except from the update 1.75.2 . 1week was a disaster

Hogan1337 · March 31, 2023, 8:22am

Thanks for the info.

Another question. if one of my nodes gets disqualified. does it also affect the others?

Alexey · March 31, 2023, 8:33am

I believe this is reduced a load on your disk.

this could help to restart the node without a user intervention, but not like a solution of a root cause - disk non-responses.

Unlikely. All nodes are independent of each other. However if they share the same hardware, it could cause all other nodes to fail.

Chucky90 · March 31, 2023, 8:42am

How can i set up: log.level info to error only? @Alexey

daki82 · March 31, 2023, 9:55am

these are compareable specs to my node.
also we have seen the error on more performant disks/systems.
can a 100/40Mbit dsl line realy outperform anything on a decent minipc who shows no sing of lacking otherwise?
load spikes exactly shorter than 2 minutes?

my logs where on the os ssd, so i think this has no connection to this “ghost in the maschine”.

also i did not change the timeouts ! i doubt again this whole “your drive is to slow to respond” theory at al. since my bandwith bottlenec is the dsl line?
how to explain that with the simultaneous beginning? is there so much test traffic compared to normal??
and the 75GB ingress a day, even in buggy times? no spikes visible.
my guess is still a bug in the networking in general.

after all: SNOs could no find any hardware errors. and then its gone in a magic way?

Asura99 · March 31, 2023, 1:00pm

2023-03-31T13:09:08.477+0200	ERROR	services	unexpected shutdown of a runner	{"name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-03-31T13:09:11.746+0200	ERROR	orders	cleaning DB archive	{"error": "ordersdb: database is locked", "errorVerbose": "ordersdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).CleanArchive:325\n\tstorj.io/storj/storagenode/orders.(*Service).CleanArchive:157\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func2:141\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-03-31T13:09:12.265+0200	ERROR	piecestore:cache	error getting current used space: 	{"error": "context canceled; context canceled", "errorVerbose": "group:\n--- context canceled\n--- context canceled"}
2023-03-31T13:09:13.349+0200	ERROR	piecestore	download failed	{"Piece ID": "S2QITWO74AFYLFJ76TV437HSHDJXFF5OLEXRRVKCZHQHMFFGB5QA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 0, "Remote Address": "216.66.40.82:15612", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:650\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-03-31T13:09:13.455+0200	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:152\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:164\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:58\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-03-31T13:09:13.467+0200	ERROR	piecestore	error sending hash and order limit	{"error": "context canceled"}
2023-03-31T13:09:13.530+0200	ERROR	piecedeleter	could not send delete piece to trash	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "QJEB4PKY2ICANQ6Y5MZ3FXH4RACSIOTADUCKIY2J5QDZUKSZEXGQ", "error": "pieces error: pieceexpirationdb: context canceled", "errorVerbose": "pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).Trash:112\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:387\n\tstorj.io/storj/storagenode/pieces.(*Deleter).deleteOrTrash:185\n\tstorj.io/storj/storagenode/pieces.(*Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(*Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-03-31T13:09:14.242+0200	FATAL	Unrecoverable error	{"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

so i changed now the timer from 1m0s to 1m30s , maybe that helps …

Asura99 · March 31, 2023, 1:10pm

open your config with notepad++(as Administrator) , search for log.level info and change it to error , then just save and restart storagenode

Asura99 · March 31, 2023, 1:18pm

my setup is also seperated ,

on the windows drive ,there is my software for storj ,

the external HDD is only for the storagenode service .

so before the update , my audits where at 100% and my CPU was mostly @ 90% (running alot in the background).

so now i stopped the background stuff - CPU is around 30% and RAM around ~20% .
and still have this not respoinding timer …

i was thinking that my problem was gone . but today the service was running around 8h and restarted itself , with the errors above ( picture one post ago) .

my traffic is normal around 60-80gb daily.

now i changed the timer +30s . lets see what happens next ^^

pcresumen · March 31, 2023, 1:39pm

The other day I explained the solution I found to my problem. I noticed that the hard disk was at 100% use for too many hours due to the fact that it checks the files each time the service is restarted. The more space occupied on the hard drive, the longer it will take to do the check, and the more stress the hard drive will be, then the probability of it causing a write or read error due to fatigue is higher.

The first thing I did when I saw the write problem was to do a “chkdsk /f” and the disk was fine.

Then I changed the maximum capacity of the node so that it did not have “ingress” and check that there was no reading problem. I let it do the file check on purpose to see if it gave the error. No error after 72 hours and disk usage down to 5-20%.

Today I have decided to proceed to restore the config.yaml as it was before the errors but with the change of “storage2.piece-scan-on-startup: false”. It has been approximately 1 hour with the hard disk at 100% use and then it has dropped to 5-20% again and it has been working correctly for +8 hours.

For now, I consider the problem solved. If there is anything I will tell you

daki82 · March 31, 2023, 2:19pm

i did nothing exept the loglevel to error and the automatic restart if it fails.
the error seems gone too for me. restarting was never the problem. exept after the update

hard disk at 100% TIME in use, is what i expect when its bombarded with read and write demands through the network, but its not at Bandwith maximum.
id be worried if otherwise.

so it can do more and still be 100%time in use. at 12 to 36ms response time its fine. and an max r/w speed what can go through the line, what is usualy a fraktion for private connections and not demanded from storj.

i guess since i read the trash gets deleted after one week, the update fucked this trash thing up.
after a week the faulty trash gets deleted and the error is gone magicaly.
notice it lasted exactly one week.

so waiting for the next update and looking at you storj
@Knowledge
also if it comes true that im right, i demand some bughunting-money.

pdeline06 · March 31, 2023, 3:18pm

@Alexey, безопасно выставлять этот параметр = 4 минутам?
storage2.monitor.verify-dir-writable-timeout: 4m00s

или какое значение будет оптимальным?

или лучше оставить 1 минуту и пусть лучше служба перезапустится, чем получить другие проблемы из-за большего интервала?

Объясните пожалуйста

ПС. ничего не помогает. Даже с установленными 4 минутами получаю ошибку
FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 4m0s while verifying writability of storage directory”

Будут какие то решения от разработчика? Или нам просто наслаждаться этим?

Maksvell · March 31, 2023, 8:58pm

I have the same problem, any one found out how to solve this problem?

snorkel · March 31, 2023, 10:45pm

Version 1.76.1 is out on Github. Install it manualy and see if you have problems still.
How to install? Just download the exe files, stop de node services and replace the old exe ones in your install dir. Start the services.

RJGSpace · March 31, 2023, 11:05pm

where to change that line??

Alexey · April 1, 2023, 1:09am

Load meaning how busy is your drive. If it’s a SMR for example, it will be slow on writes. If you use RAID - it will be as slow on writes as a slowest drive in the RAID.
In short - your node is unable to write/read a small file even after 1 minute. This is the root cause.

Just load of your HDD is reduced and now it’s able to read/write for less than 1 minute. May be a CPU is involved too.