ERROR filewalker failed to get progress from database

thelastspark · June 19, 2024, 4:52pm

So I migrated the dbs, everything is running fine, EXCEPT THE USED SPACE WALKER,

Here are the results of the db scan:

Any way to get the filewalker to start running again? My nodes are basically capped out but report completely wrong numbers

Alexey · June 20, 2024, 7:52am

It’s started, but it has not the extensive logging as a lazy one:

So you may track it by the load of the disk or using a debug endpoint: Guide to debug my storage node, uplink, s3 gateway, satellite.

Alexey · June 20, 2024, 7:54am

Yes, make sure that you didn’t disable it on startup (it’s enabled by default) and restart the node. It may start not immediately but after a while.
Please consider the logs missing issue linked above for the non-lazy mode.

thelastspark · June 20, 2024, 11:18am

I’ll enable debug port and do a full stack dump shortly. However, I re-enabled lazy filewalker and after failing (on 2024-06-18T08:59:08-04:00) it hasn’t tried to restart yet…

Edit,
Yup still getting the following on startup (with lazy enabled):

2024-06-20T07:20:06-04:00	INFO	lazyfilewalker.used-space-filewalker	subprocess exited with status	{"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": 1, "error": "exit status 1"}
2024-06-20T07:20:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-20T07:20:06-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-20T07:20:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-20T07:20:06-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-06-20T07:20:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-20T07:20:06-04:00	ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess	{"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-20T07:20:06-04:00	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-20T07:20:06-04:00	ERROR	piecestore:cache	error getting current used space: 	{"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

And after that filewalker no longer attempts to run

Alexey · June 21, 2024, 3:25am

Yes, if it’s failed with exit code 1, then it’s failed completely (it will disappear from the processes list).
Since it cannot finish in a lazy mode, then you need to disable the lazy mode to allow them to finish.

thelastspark · June 23, 2024, 1:50am

Just wanted to post a 32hour update, since I have no way of checking logs for filewalker status I have no idea if they are running/crashed - but the used space in the dashboard is 100% way off from the windows (OS) reported space.

Alexey · June 23, 2024, 3:30am

Do you have errors related to databases in your logs?

thelastspark · June 23, 2024, 3:12pm

Outside of the errors I pasted above, no. And like I said, I ran the db checks and all said OK

Alexey · June 24, 2024, 3:36am

Yes, I saw your databases integrity checks results, I only wanted a confirmation, that you do not have errors related to the databases in your logs.

Did you disable the lazy mode? Because the lazy one is failing for your setup and if it’s failed it would not be restarted. The non-lazy filewalker would work with a normal priority, so it should not fail with the “context canceled” errors. Unfortunately the non-lazy one doesn’t print anything to the logs, so its progress you may see only via debug port or by disk load.
You also may try to use this trick to track its progress:

ERROR lazyfilewalker.gc-filewalker.subprocess failed to save progress in the database

So looking for non-lazy filewalker progress, I grepped an ls -l /proc/STORJPROCESSID/fd, using the main process ID. I found only two lines with “blobs” that seemed to match what I look for in the lazy version.
lr-x------ 1 qwinn qwinn 64 Jun 15 02:29 67 -> /app/config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa
lr-x------ 1 qwinn qwinn 64 Jun 15 02:29 68 -> /app/config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/aa
So. According to this, it would seem it’s still working on aa.

thelastspark · June 24, 2024, 1:47pm

Yea I disabled lazy and the error logs are silent, I didn’t mean to imply that the context cancelled were new ones (only old ones).

Does that solution (trick) work for windows? I’m also not sure what exactly it’s saying to do.

I have enabled the debug port but not sure what I need to look for in the debug URL?

thelastspark · June 25, 2024, 1:25am

After 80 hours of uptime on both (8TB and 16TB) the nodes used space hasn’t changed. I strongly suspect something in the used-space cache is borked and its preventing any used-space from continuing. Is there a way for me to have used space start from scratch???

pangolin · June 25, 2024, 1:46am

First measure should be a file system check. Have you done this?

thelastspark · June 25, 2024, 2:57am

Can I get your rationale why that is necessary? My nodes have been running for a over a week like this with no audit score drops and if not my artificially reducing their size they would continue to grow.

It’s also known that 1.104 had an issue with the used filewalker (and my issue happened roughly around the time that my nodes got upgraded to 1.105 which was supposed to fix said issue) so a lot of things point to it being more software than hardware.

Additionally, for 2 different drives, completely different, and started at different times to fail in an identical manner…seems kinda unlikely no?

P.S.
DB passed all checks and there are no other ERRORs in my logs as well

Alexey · June 25, 2024, 5:27am

Yes, but you need to use a Resource Monitor for that - you may select a process and see what paths are currently used on the Disk tab.

On the same hardware and the same OS I suppose?

redgyuf · June 25, 2024, 7:17am

for me it took just about 80 hours on 2 of my node to update the used space without any ingress, so if you have ingress while the walker doing its job it could take more time.

redgyuf · June 25, 2024, 7:18am

Can you please provide some more detail about this?

Alexey · June 25, 2024, 7:29am

thelastspark · June 25, 2024, 12:46pm

It’s the same windows 10 machine, with 3 nodes, 2 of which ran into this issue correct.

As for resource monitor I see:

Is this what you meant? Drives F and Z are the ones of interest

thelastspark · June 25, 2024, 12:46pm

is there a specific command or line to look for? I read through it and didn’t understand what I need to look for (I did get it working on my nodes however)

thelastspark · June 25, 2024, 7:25pm

While clearly this issue wasn’t going to be escalated to the right attention, I tried mucking around with the databases since the logs strongly seem to imply a previous run was interrupted or failed in some way so I suspect I just need to tell it to forget any previous used-space/bloom filter cache or w/e else is going on.

I may have found a way to kick it in the nuts to make it recheck everything - so far I don’t see the error (with lazy on and log level INFO) so fingers crossed lazy actually completes this time. If it works I’ll report back how to do it, others may find it useful to force a recheck if their cache/usedspace progress gets borked like mine