There were several discussions and open concerns related to filewalker processes (such as garbage collection and the used-space calculation services) competing for disk IOPS with same priority as more important services such customer downloads/uploads, audits and repairs. The community introduced workarounds to disable the initial used-space calculation on startup by adding the --storage2.piece-scan-on-startup flag.
The lazyfilewalker runs the used-space calculation and garbage collection filewalkers as low priority subprocesses. It came up as a solution to the earlier mentioned problems, where these filewalkers run with the same I/O priority as customer downloads.
The complete functionality is available to nodes running on v1.80.0 and later. By default, the lazyfilewalker is disabled, and it can be enabled setting --pieces.enable-lazy-filewalker=true.
We need volunteers to help this this functionality (especially for large nodes running on windows).
Once enabled, you can check the logs to see if the subprocess runs/completes successfully:
The used-space lazyfilewalker will not run if –storage2.piece-scan-on-startup is set to false. If you disabled it, then you would have re-enabled it or remove the flag since the default is true. For those who previously disabled the used-space scan on startup, your config should be:
The lazyfilewalker is used for garbage collection as well. So no you don’t need to enable the used space calculation on startup to get the benefit of the lazyfilewalker.
I’m happy to volunteer, but I don’t fully understand what I need to do. Please explain, and I’ll be more then happy to assist.
My windows node is currently 10 Tb. Is that considered a large node?
The question remains is it required to run on every start or restart?
After an initial space scan has been performed, it may be enough if it runs every once in a while.
2023-06-25T18:21:16.181+0200 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-06-25T18:21:16.187+0200 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-06-25T18:21:19.607+0200 ERROR lazyfilewalker.used-space-filewalker subprocess exited with error {"satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\""}
2023-06-25T18:21:19.607+0200 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\"", "errorVerbose": "lazyfilewalker: parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"+0200\" as \"Z07:00\"\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:80\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:709\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:57\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
Is it something with the time in general or timezone?
parsing time \"2023-06-25T18:21:16.268+0200\" as \"2006-01-02T15:04:05Z07:00\
cannot parse \"+0200\" as \"Z07:00\"
Edit: Apparently the lazy fw is running anyway, because there are additional PIDs with “background” i/o-priority in the resource monitor.
Yes, it should show at process level because it’s a (sub)process.
On linux, we set it to run with best effort priority (IOPRIO_CLASS_BE) which is by default the same for all processes with no specified priority. But for the lazyfilewalker subprocess, we set the priorioty level (or classdata) for this class to the lowest priority level (which is 7).
IOPRIO_CLASS_BE (2)
This is the best-effort scheduling class, which is the
default for any process that hasn't set a specific I/O
priority. The class data (priority) determines how much
I/O bandwidth the process will get. Best-effort priority
levels are analogous to CPU nice values (see
getpriority(2)). The priority level determines a priority
relative to other processes in the best-effort scheduling
class. Priority levels range from 0 (highest) to 7
(lowest).
So you see the lazyfilewalker subprocess has the same priority class as the other processes but has the lowest priority level in that class and hence, gets the lower I/O bandwidth.
I pretty much have the option: storage2.piece-scan-on-startup: false on all the nodes. Should I remove it and enable the Lazyfilewalker? Or is it running now with the diesabled piece-scan on the startup as well?
I would recommend allowing piece scan on startup and enabling the lazyfilewalker. The lazyfilewalker will only run for garbage collection but not used-space calculation if the piece scan on startup is disabled.
Thank you for the detailed answer, I indeed was confusing the CPU priority and the IO priority. This is available to see on the htop panel, it is not enabled by default but can be through the setup menu.
@clement How about the cleanup job that removes pieces from the trash folder after 7 days? I believe that is a file walker process as well just on the trash folder and not on the blobs folder. Is that cleanup process also running with low IO priority? Does the IO priority also affect the delete operation or would that be kicked of with normal priority?