Discussion: Advantages / disadvantages lazy filewalker

Solu · August 1, 2024, 10:26am

Hi folks,

Does someone really tested the advantages of lazy filewalker compared to node performance?
I would like to know at what point the lazy mode doesn‘t make any sense compared to the “normal” mode.

As far is i understand the lazy mode was implemented to support low to mid performance nodes
to keep a good balance between filewalker process and data access responsibility.
So for me, low performance nodes means something like:

ARM CPU
1 HDD per node on USB
2GB RAM or less per node
no db on SSD
no cache
(Well, @littleskunk your tuned rpi might not be meant here ;> )…

Hypothesis: mid to high performance nodes could be operated more efficient if the filewalker is set to “normal”?

Mid to high performance nodes:

Xeon, Ryzen CPU
1 HDD per node on SATA/SAS
8GB RAM or more per node
db on SSD/NVMe
with or without optional filestat cache

Anyone?

Roxor · August 1, 2024, 11:10am

I thought the lazy filewalker just allowed other HDD IO (such as node ingress/egress) to take priority? To me the node should always have priority over housekeeping tasks no matter how fast the hardware is. That would make using “non-lazy” filewalker a crutch to get runtimes down… at the expense of gimping your node while it runs.

What do you mean by running a node “more efficiently”? If normal filewalker occasionally competed with node IO (and perhaps loses you a race) wouldn’t that be less efficient?

Filewalker runtimes boil down to raw IO capabilities of drive, metadata caching by memory/SSD, and competition for access to the drive. Don’t make your node fight other processes for access. Unless you’re trying the new BadgerDB features (which don’t support lazy-mode yet)… using normal filewalkers is doing it wrong

TL;DR; Use ZFS metadata devices to make metadata access fast, and leave filewalkers as lazy

EasyRhino · August 1, 2024, 4:59pm

I don’t think it’s about CPU usage… so much as disk usage.

Running non-lazy filewalker will often lead your disk 100% busy between it and normal storj behaviors (including uploads, downloads, and garbage collection). This could lead to your node losing on uploads or downloads due to slower performance.

Lazy filewalker is much more chill. my disk isn’t at 100% even while it’s running.

normal filewalker can sometimes finish doing a used space run much faster. Hours instead of days. Days instead of weeks.

In older releases, the normal filewalker didn’t print anything to the log when it started or stopped but that is fixed by at least ver 108.

The lazy filewalker is more likely to fail with a “context canceled” error. I’m not the definitive expert but this may be a sensitivity to slow disk performance. Or just a bug.

I generally stick with the lazy walker UNLESS

it’s failing
I have an urgent need to finish a filewalker (like reported disk usage is way off)

However, ironically, if things are operating smoothly on a node and disk reporting is accurage and the filewalker can complete, then there also wouldn’t be an issue running normal walkers…

ACarneiro · August 1, 2024, 5:56pm

Yes, but it’s gotten to the point where there are SO many issues with filewalkers that I would personally prefer gimping my node for a few hours (or days) in order to get all the housekeeping done and THEN have them work more reliably.
It’s a bit of a sad state of affairs right now, but I’m keeping my head down and waiting for the issues to gradually be fixed.
And I am sure they will.

EasyRhino · August 1, 2024, 10:27pm

and to be fair, a “gimped” node running filewalkers still seems to be perfectly acceptable of uploading and downloading storj requests.

Alexey · August 2, 2024, 4:09am

perhaps not good to have them in lazy, if the node is running in the VM: the host is not aware of the low IO priority processes inside the VM, so in almost all cases the lazy filewalker wouldn’t have a chance to get at least something out of the common pot with IOPS.

This is usually happen on a slow disk or VM or in even worse setup - Windows VM.

See also:

Alexey · August 2, 2024, 4:14am

Did you compare the success rate with a lazy filewalker running and with lazy off?
I think they will differ, especially on low-end devices. Perhaps significantly.

Solu · August 2, 2024, 10:14am

I think “efficiency” is much more then just only lose a race here and there. It is more like how the node performs in terms of won races compared to running costs.

What is see is, that with lazy implementation the filewalker runs days instead of hours especially on my 20TB nodes. I saw a decent increase of power usage during the lazy mode because the disk handles I/O much longer then before…

So i observed the power consumption a bit and noticed that the disk needs nearly the same power compared to the “normal” filewalker. I think this is because the mechanic inside the disk is constantly used too.

So the question is, is it worth it to run the filewalker for days with 8.5W per disk instead of hours with 9.7W per disk? And how does the win/lose ratio look like. I could imagine that potentially won races do not cover the higher costs especially if you run multiple nodes.

Don‘t get me wrong here… i think the lazy mode is very useful for low performance nodes in order to participate in the network but there is this salty downside on power usage that might could be avoided if the node is better equipped with ram/db on ssd and lazy off…

Roxor · August 2, 2024, 2:41pm

Long-term power use is definately a concern for SNOs (especially large ones). If most can’t afford to have all metadata for filewalkers in RAM all the time: is perhaps our goldilocks config to have it on SSDs instead?

Like if all housekeeping is sent to 0.5w SSDs and only data ingress/egress makes it to the 8-10w HDDs… that would be faster and more power-efficient that any filewalker config that hits only the HDD. I guess the question turns into “Are SSDs affordable enough to be worth it?”

This is probably where a Storjling steps in to remind us to “only use what you already own”

pangolin · August 2, 2024, 6:09pm

It doesn’t matter in my opinion. I mean a storj hdd should never go to idle state, so the difference will be hard to recognize.

Julio · August 2, 2024, 8:45pm

I offer a simple quick solution, monitoring and adjustment apps like “Process Lasso”, can be configured to distinguish lazy walker process running and throttle it’s: i/o, affinity, memory, cpu speed, etc., within a VM. Additionally, it can automatically balance an entire regular or system running in a VM, by manipulating the process hogs. As well, it can prioritize any specific app/process, or set of them, keep memory trimmed, etc. etc… There are shitloads of options.

2 1/2 cents

Alexey · August 3, 2024, 2:07am

But for some reason every Windows VM is still slow. You can search across the forum and quickly figure out, that most of issues with disk discrepancy, slow filewalkers, slow GC, everything slow and CPU/RAM hungry are starting with “I’m running Windows VM on VMWare”.