Enabling the Lazyfilewalker

There were several discussions and open concerns related to filewalker processes (such as garbage collection and the used-space calculation services) competing for disk IOPS with same priority as more important services such customer downloads/uploads, audits and repairs. The community introduced workarounds to disable the initial used-space calculation on startup by adding the --storage2.piece-scan-on-startup flag.

The lazyfilewalker runs the used-space calculation and garbage collection filewalkers as low priority subprocesses. It came up as a solution to the earlier mentioned problems, where these filewalkers run with the same I/O priority as customer downloads.

The complete functionality is available to nodes running on v1.80.0 and later. By default, the lazyfilewalker is disabled, and it can be enabled setting --pieces.enable-lazy-filewalker=true.
We need volunteers to help this this functionality (especially for large nodes running on windows).
Once enabled, you can check the logs to see if the subprocess runs/completes successfully:

docker logs storagenode | grep "lazyfilewalker.*.subprocess"
5 Likes

The setting

–storage2.piece-scan-on-startup=false

controls both the new and the old filewalker? I need to uncomment this to get the lazy one running.

Is this the expected output from logfile?

2023-06-15T14:27:55.325Z        INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-06-15T14:27:55.327Z        INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-06-15T14:27:55.406Z        INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode"}
2023-06-15T14:27:55.406Z        INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode"}

The used-space lazyfilewalker will not run if –storage2.piece-scan-on-startup is set to false. If you disabled it, then you would have re-enabled it or remove the flag since the default is true. For those who previously disabled the used-space scan on startup, your config should be:

--storage2.piece-scan-on-startup=true
--pieces.enable-lazy-filewalker=true

Yes, and it shows that the lazyfilewalker is running.

1 Like

The lazyfilewalker is used for garbage collection as well. So no you don’t need to enable the used space calculation on startup to get the benefit of the lazyfilewalker.

1 Like

@clement can we enable the lazy filewalker by default? It runs great on my nodes. I haven’t noticed any problems.

I’m happy to volunteer, but I don’t fully understand what I need to do. Please explain, and I’ll be more then happy to assist.
My windows node is currently 10 Tb. Is that considered a large node?

You need to add this option to the config.yaml:

pieces.enable-lazy-filewalker: true

save the config and restart the node either from the Services applet or from the elevated PowerShell:

Restart-Service storagenode

The question remains is it required to run on every start or restart?
After an initial space scan has been performed, it may be enough if it runs every once in a while.

2 Likes

Version 1.80.10 / Windows binary

Get the following errors on every satellite.

2023-06-25T18:21:16.181+0200	INFO	lazyfilewalker.used-space-filewalker	starting subprocess	{"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-06-25T18:21:16.187+0200	INFO	lazyfilewalker.used-space-filewalker	subprocess started	{"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-06-25T18:21:19.607+0200	ERROR	lazyfilewalker.used-space-filewalker	subprocess exited with error	{"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\""}
2023-06-25T18:21:19.607+0200	ERROR	pieces	failed to lazywalk space used by satellite	{"error": "lazyfilewalker: parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\"", "errorVerbose": "lazyfilewalker: parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\"\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:80\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:709\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}

Is it something with the time in general or timezone?

parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\

cannot parse \"+0200\" as \"Z07:00\"

Edit: Apparently the lazy fw is running anyway, because there are additional PIDs with “background” i/o-priority in the resource monitor.

This looks like a time format issue. I’ve created a GitHub issue for this [storagenode] Lazyfilewalker fails on windows with errors from parsing logs from stderr of subprocess · Issue #6006 · storj/storj · GitHub

Have the error too, should i be worried? disable it again?

You don’t need to worry about it. When the lazyfilewalker fails, it falls back to the regular filewalker

2 Likes

Similar error on windows, will wait for next release to test it again

Hello, is it supposed to show at the process level ? Because it appears with the same priority as the other main storj processes (20 here) :

20   0  722M 54428 10040 D  1.3  0.7  0:08.37 /…/storagenode run …
20   0  720M 16176  8464 D  1.3  0.2  1:06.73 /…/storagenode used-space-filewalker …

It’s an internal function, not the entire process, so unlikely you can see it on the processes level.

1 Like

Yes, it should show at process level because it’s a (sub)process.

On linux, we set it to run with best effort priority (IOPRIO_CLASS_BE) which is by default the same for all processes with no specified priority. But for the lazyfilewalker subprocess, we set the priorioty level (or classdata) for this class to the lowest priority level (which is 7).

From the man page:

IOPRIO_CLASS_BE (2)
              This is the best-effort scheduling class, which is the
              default for any process that hasn't set a specific I/O
              priority.  The class data (priority) determines how much
              I/O bandwidth the process will get.  Best-effort priority
              levels are analogous to CPU nice values (see
              getpriority(2)).  The priority level determines a priority
              relative to other processes in the best-effort scheduling
              class.  Priority levels range from 0 (highest) to 7
              (lowest).

So you see the lazyfilewalker subprocess has the same priority class as the other processes but has the lowest priority level in that class and hence, gets the lower I/O bandwidth.

2 Likes

I pretty much have the option: storage2.piece-scan-on-startup: false on all the nodes. Should I remove it and enable the Lazyfilewalker? Or is it running now with the diesabled piece-scan on the startup as well?

I would recommend allowing piece scan on startup and enabling the lazyfilewalker. The lazyfilewalker will only run for garbage collection but not used-space calculation if the piece scan on startup is disabled.

1 Like

Thank you for the detailed answer, I indeed was confusing the CPU priority and the IO priority. This is available to see on the htop panel, it is not enabled by default but can be through the setup menu.

1 Like

@clement How about the cleanup job that removes pieces from the trash folder after 7 days? I believe that is a file walker process as well just on the trash folder and not on the blobs folder. Is that cleanup process also running with low IO priority? Does the IO priority also affect the delete operation or would that be kicked of with normal priority?