So I migrated the dbs, everything is running fine, EXCEPT THE USED SPACE WALKER,
Here are the results of the db scan:
Any way to get the filewalker to start running again? My nodes are basically capped out but report completely wrong numbers
So I migrated the dbs, everything is running fine, EXCEPT THE USED SPACE WALKER,
Here are the results of the db scan:
Itās started, but it has not the extensive logging as a lazy one:
So you may track it by the load of the disk or using a debug endpoint: Guide to debug my storage node, uplink, s3 gateway, satellite.
Yes, make sure that you didnāt disable it on startup (itās enabled by default) and restart the node. It may start not immediately but after a while.
Please consider the logs missing issue linked above for the non-lazy mode.
Iāll enable debug port and do a full stack dump shortly. However, I re-enabled lazy filewalker and after failing (on 2024-06-18T08:59:08-04:00) it hasnāt tried to restart yetā¦
Edit,
Yup still getting the following on startup (with lazy enabled):
2024-06-20T07:20:06-04:00 INFO lazyfilewalker.used-space-filewalker subprocess exited with status {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": 1, "error": "exit status 1"}
2024-06-20T07:20:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-20T07:20:06-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-20T07:20:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-20T07:20:06-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-06-20T07:20:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-20T07:20:06-04:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-20T07:20:06-04:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-20T07:20:06-04:00 ERROR piecestore:cache error getting current used space: {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
And after that filewalker no longer attempts to run
Yes, if itās failed with exit code 1
, then itās failed completely (it will disappear from the processes list).
Since it cannot finish in a lazy mode, then you need to disable the lazy mode to allow them to finish.
Just wanted to post a 32hour update, since I have no way of checking logs for filewalker status I have no idea if they are running/crashed - but the used space in the dashboard is 100% way off from the windows (OS) reported space.
Do you have errors related to databases in your logs?
Outside of the errors I pasted above, no. And like I said, I ran the db checks and all said OK
Yes, I saw your databases integrity checks results, I only wanted a confirmation, that you do not have errors related to the databases in your logs.
Did you disable the lazy mode? Because the lazy one is failing for your setup and if itās failed it would not be restarted. The non-lazy filewalker would work with a normal priority, so it should not fail with the ācontext canceledā errors. Unfortunately the non-lazy one doesnāt print anything to the logs, so its progress you may see only via debug port or by disk load.
You also may try to use this trick to track its progress:
Yea I disabled lazy and the error logs are silent, I didnāt mean to imply that the context cancelled were new ones (only old ones).
Does that solution (trick) work for windows? Iām also not sure what exactly itās saying to do.
I have enabled the debug port but not sure what I need to look for in the debug URL?
After 80 hours of uptime on both (8TB and 16TB) the nodes used space hasnāt changed. I strongly suspect something in the used-space cache is borked and its preventing any used-space from continuing. Is there a way for me to have used space start from scratch???
First measure should be a file system check. Have you done this?
Can I get your rationale why that is necessary? My nodes have been running for a over a week like this with no audit score drops and if not my artificially reducing their size they would continue to grow.
Itās also known that 1.104 had an issue with the used filewalker (and my issue happened roughly around the time that my nodes got upgraded to 1.105 which was supposed to fix said issue) so a lot of things point to it being more software than hardware.
Additionally, for 2 different drives, completely different, and started at different times to fail in an identical mannerā¦seems kinda unlikely no?
P.S.
DB passed all checks and there are no other ERRORs in my logs as well
Yes, but you need to use a Resource Monitor for that - you may select a process and see what paths are currently used on the Disk tab.
On the same hardware and the same OS I suppose?
for me it took just about 80 hours on 2 of my node to update the used space without any ingress, so if you have ingress while the walker doing its job it could take more time.
Can you please provide some more detail about this?
Itās the same windows 10 machine, with 3 nodes, 2 of which ran into this issue correct.
As for resource monitor I see:
F
and Z
are the ones of interest
is there a specific command or line to look for? I read through it and didnāt understand what I need to look for (I did get it working on my nodes however)
While clearly this issue wasnāt going to be escalated to the right attention, I tried mucking around with the databases since the logs strongly seem to imply a previous run was interrupted or failed in some way so I suspect I just need to tell it to forget any previous used-space/bloom filter cache or w/e else is going on.
I may have found a way to kick it in the nuts to make it recheck everything - so far I donāt see the error (with lazy on and log level INFO) so fingers crossed lazy actually completes this time. If it works Iāll report back how to do it, others may find it useful to force a recheck if their cache/usedspace progress gets borked like mine