maybe but how do i know when file walker is timed to run?
The question is, why is the disk slow?
could be fragmentation, big MFT, or many others, including bad disk.
How to workaround and icrease the timeouts: its in the other post.
- Stop the node
- Check and fix errors on the disk
- Run a defragmentation for this disk
- Enable an automatic defragmentation back, if you disabled it (it’s enabled by default)
- Start the node
If the problem would occur again
Hi there.
I’m running a a node in a windows 10 PC. Since last week the storage node service keeps stopping constantly. I can restart it but an hour later stops. These are the last three lines in the storagenode.log
2024-02-08T14:55:45+01:00 INFO piecestore upload canceled {“Piece ID”: “4ZZJQZSJO6M2OJK4OGSK6ZX5EYLKSD6VTIVFCDYZQARUYUBN3VAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 65536, “Remote Address”: “5.161.176.200:54238”}
2024-02-08T14:55:45+01:00 INFO piecestore upload canceled (race lost or node shutdown) {“Piece ID”: “JLAT7KP4EW6SKWGCE3MXPDFH2VHKYARBJKJWCN56QQ46G4HQDTKQ”}
2024-02-08T14:55:45+01:00 INFO piecestore upload canceled (race lost or node shutdown) {“Piece ID”: “E2WPLGMSHGQZMITMPGC2VI4X3BF4UEAFUDR2BESREAVN3OTSWJZA”}
2024-02-08T14:55:45+01:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying readability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1.1:152\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:141\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
Any thoughts?
Thank you.
Hello!
I am running version v1.101.3 on Windows 11.
The node is relatively new has a few months of life.
The 4TB Western Digital HDD is external and connected via USB to the PC. I have verified the HDD for issues with diagnostic tool Victoria but it is fine.
2024-04-21T00:38:01+02:00 ERROR failure during run {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-04-21T00:38:01+02:00 FATAL Unrecoverable error {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-04-22T07:34:17+02:00 ERROR failure during run {"error": "database is locked"}
2024-04-22T07:34:17+02:00 FATAL Unrecoverable error {"error": "database is locked"}
Any ideas?
That looks like the drive is going read-only: causing Storj to throw those first errors (or something is causing writes to take over a minute). I’d expect to see entries in your Windows Event Logs around the same time.
It seems to be working now: at least for the whole day without further issues Before it was happening rather often, multiple times during the day. I did not do anything special besides restarting windows.
Let’s see how it goes.
and the same solution:
Hey! im having a problem with 1 of my nodes atm. it updated to the newest version 1.101.3 and after that it wont start saying this
2024-05-01T11:51:16+03:00 | INFO | Current binary version | {Service: storagenode-updater, Version: v1.101.3} |
---|---|---|---|
2024-05-01T11:51:16+03:00 | INFO | New version is being rolled out but hasn’t made it to this node yet | {Service: storagenode-updater} |
Any ideas?
Does it provide any errors when starting? Can you search your logs for the word fatal and see what comes up?
Hello @Hjallisharkimo,
Welcome to the forum!
You showed logs from the storagenode-updater, not from storagenode.
But I guess you are on Windows GUI and have a duplicated keys in your config.yaml
and your node is stopping, am I right?
If so, you may check this:
This one
Is not a problem actually. You need to wait, until the release would be available for your NodeID.
I have a similar problem on my node:
2024-05-17T12:06:03Z ERROR services unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-05-17T12:06:03Z INFO lazyfilewalker.used-space-filewalker subprocess exited with status {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "status": -1, "error": "signal: killed"}
2024-05-17T12:06:03Z ERROR pieces failed to lazywalk space used by satellite {"Process": "storagenode", "error": "lazyfilewalker: signal: killed", "errorVerbose": "lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-05-17T12:06:03Z INFO lazyfilewalker.used-space-filewalker starting subprocess {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-05-17T12:06:03Z ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-05-17T12:06:03Z ERROR pieces failed to lazywalk space used by satellite {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:704\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-05-17T12:06:18Z WARN services service takes long to shutdown {"Process": "storagenode", "name": "retain"}
2024-05-17T12:06:18Z WARN services service takes long to shutdown {"Process": "storagenode", "name": "piecestore:cache"}
2024-05-17T12:06:18Z WARN servers service takes long to shutdown {"Process": "storagenode", "name": "server"}
2024-05-17T12:06:18Z INFO services slow shutdown {"Process": "storagenode", "stack": "goroutine 1078 [running]:
After that there is a very long stack trace, I can provide a full log if that helps. I am running a docker node on my Synology RS1221+.
Your node hit a timeout while checking writability. Try to create a file with content on the drive to check.
Having weird problem in last 2 days.
storj PC stay on as that run my pihole vm, there no issues with it or network seems to be on itself
Storj went offline 2 days ago in the night time woke up checked and put it on
Rebooted the machine yesterday as it went offline, it was offline last night and earlier went offline again
my network itself is on as I have been online the whole day. this
don’t think its resource issue, any suggestion i can provide the logs so someone can advise?
shows suspension at 95% as well
The scores will fix themselves over time. Check your logs when it went down to get a better picture of what happened. You could try searching for fatal
errors in the log.
Have you tried another editor?