Slowing down filewalker

Pentium100 · March 1, 2024, 8:17am

As I understand, currently, the filewalker runs with lower IO priority, but it still goes at full speed, producing IO load of 100%. At least it looks so on my server:
graph_image-2

My suggestion is that there should be a way to throttle the filewalker - inserting sleep() or equivalent every x files checked. This ideally should be possible to adjust by the node operator (limit the CPU IOWAIT or load average). I’m pretty sure that loading the hard drives to 100% will degrade the node performance by increasing the latency, even though filewalker is supposed to be lower priority.

Mitsos · March 1, 2024, 10:27am

That’s how priority processes work: they use all available resources (in this case IO) until something else comes up with a higher priority, in which case they “pause” until that higher priority completes, or if that higher priority only uses 50% of the available resources, the rest goes to the process with a lower priority. Nothing to fix here.

Pentium100 · March 1, 2024, 11:26am

I know that’s how priorities work. In theory. In practice, a lower priority process can still create a longer queue on the hard drive and increase latency. Just like zfs scrub has low priority (it even pauses itself for a short time) and still slows things down.

In my specific case, the node runs inside a VM, so while the processes inside the VM have different priorities, the VM itself has the same priority as the others.

It would be useful if the filewalker could be made to use not all available resources, but only part of them, leaving some at idle.

Mitsos · March 1, 2024, 11:38am

If the VM has the same priority as all other VMs, then your filewalker isn’t actually running in a lower priority, is it? The host sees a read request coming from your storj VM (equal priority as your DNS resolver VM) and then tries its best to accomodate it. If your DNS resolver VM needs to read something as well, well latency should go up since two equal priority processes are using the same resources.

If storj was running baremetal (ie as I run all of the nodes I look after (a lot, don’t ask)) then the host system knows how to properly handle all of the priorities. Yes latencies do go up (props on using zabbix) but if you see what the iowait actually is (waiting for input/output, ie waiting on disk to read) you’ll see that it’s not an issue. If the node gets hammered, filewalker “scales down” and the node’s response time stays the same. Latency will increase regardless if the filewalker runs at a lower priority, or if it is even running “speed limited” (ie read up to 10MB/s). The disk’s head still needs to seek around to find something, so latency shoots up. Of course this is a limitation of normal HDDs and not SSDs.

Roxor · March 1, 2024, 12:12pm

The low-priority filewalker seems to work fine now: it still works as fast as it can while deferring to other IO. I don’t see why you’d want to throttle it, as the same amount of work needs to get done eventually?

If you want to assign VM priorities you can tell the Hypervisor . ESXi has Disk Shares and Hyper-V has Storage QoS - I don’t know what the features are called in other tools but they’re common tunables.

Pentium100 · March 1, 2024, 12:47pm

If the filewalker was throttled, it would still increase latency, but not by that much.

I run the node in a VM because I want to use the server for other things, you know, like Storj always says to only use stuff that would be online anyway (well, seeing by how much space the node uses, multiple drives are pretty much dedicated to it). I have maxed out the RAM in that server, 192GB - 128GB is used for the ARC on the host and 16GB is given to the VM.

My problem is that the node sometimes crashes and restarts because it times out the writability check.

2024-03-01T07:56:45Z    DEBUG   Unrecoverable error     {"process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
Error: piecestore monitor: timed out after 1m0s while verifying writability of storage directory

That’s why it would be useful (for me at least) if it was possible to insert some sleep() cycles in the filewalker process to avoid the 150MB/s read data rate.

It looks like it is reading stuff that is cached by the host now, as the data rate is still high, but the load on the actual drives is low

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.68    0.00   37.81    8.86    0.18   39.47

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           36048.68   16.07 150340.67    814.00  1497.62    43.00   3.99  72.80    0.15    2.50   5.01     4.17    50.66   0.02  88.50

Thatwould drop the priority of the whole Storj VM, including uploads and downloads.

It would take longer, but hopefully not overload the IO system and crash the node.

Roxor · March 1, 2024, 1:11pm

Ah, I see what you’re trying to do now. Yeah, if the node can’t write to its disk for a minute: something fundamental is broken. Slowing filewalker may not “fix” it, but it may let that write test complete in 59 seconds

Mitsos · March 1, 2024, 1:13pm

As I have already explained: your filewalker runs at normal priority even if you turned the lazy filewalker on, since your VM is running at the same priority as other VMs. If you want to continue using virtualization for something that it wasn’t meant to be used, you need to work around the limitations of your environment: tune the VM’s priorities at the host level, not inside the VM.

I’ll try and explain it even further: your filewalker isn’t overloading the system, your storj VM is. Typical HDD response times are (looks up a node on my zabbix) less than 800msec on a disk that is not part of an array and is dedicated to storj. This figure contains filewalker + gc-filewalker+normal usage, all running concurrently and more importantly the databases are still on the same disk. If you have timeouts (1minute, ie 60x1000=60000msec), you have other problems. You can add all the RAM you want, run slog, run special vdevs, even dedicate NVME disks to databases, but your storage array will still be bottlenecked. Why? because your host doesn’t know that filewalker is trying to read an 8MB file. It sees a process trying to read a 12TB file (oversimplifying it, and I will get flak for this, but I can work with that).

Pentium100 · March 1, 2024, 1:56pm

Storj node was intended to be used in a system that is also used for other things. Or so Storj tells us every time - use only the hardware that would be online anyway. Well, I would not keep a server online and not use it for anything.
As such, I think Storj node should be able to be used inside a VM on a host that has other stuff running on it. For that purpose it should be able to slow the non-essential processes down to not affect the host.
At least I think that’s reasonable.

I have SLOG and RAM does help. AFAIK filewalker is mostly reading the metadata. Metadata takes up less space and if it was fully cached, I would not have this problem. Right now the filewalker process is still reading at 130-150MB/s, but most of that is already cached, so now it does not affect the performance and the drives are not that busy.

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
15:51:26     3     0      0     0    0     0    0     0    0  126G  127G   7.7G
15:52:26   33K     7      0     7    0     0    0     1    0  126G  127G   7.7G
15:53:26   34K     9      0     9    0     0    0     1    0  127G  127G   8.0G
15:54:26   34K     8      0     8    0     0    0     2    0  127G  127G   7.8G

It looks like the filewalker is a separate process. I remember that it somehow is possible to throttle the IO of a process using cgroups, but I do not remember the details, maybe I can limit it myself.

Mitsos · March 1, 2024, 2:10pm

I’ll give a clearer example:

I run storj baremetal. That means the following processes are running:

Process A
Process B
Process C
Storj

In my environment, storj can run without any problems no matter how many processes I load up on the system. The system knows that storj’s filewalker (which will spawn a separate process 5 above) needs to run at a lower priority than the other 4 processes already running. This is “use what you have, you can run it on unused resources”.

You run storj in a VM. You have the following processes running:

Process A
Process B
Process C
Process D

I don’t see storj anywhere there (remember, I am the host system). Your storj VM is process D. Still it doesn’t show up in my records and I know nothing about it. The filewalker comes on. I now see process D that is requesting a lot of data. I need to provide that data, so processes 1-3 get a smaller slice of the resources. Your filewalker isn’t process number 5. It is still process number 4 and I know nothing about running it at a lower priority. That is how virtualization works.

Person ABC runs storj in a container environment. He has the following processes running:

Process A
Process B
Process C
Storj

Filewalker comes on. A new process with a lower priority is created, bringing the total process number at 5. As the host system, I know that process number 5 needs to be deprioritized in case any of the other processes need access to resources.

Compare the three examples: You can see that no matter how much resources you throw at the problem, it can NOT be solved. Those old enough to remember slashdot in its glory days will get this: Obligatory car analogy is trying to go up a very steep hill in 5th gear. Yes you can go there, but I can go there faster in my 3rd gear. Even if your car has 1000bhp, I can get there with 100bhp. My car costs $6K, your car costs $100K.

Don’t get me wrong, virtualization is perfect for running different OS environments. Containerization is perfect for running same OS environments since the kernel (the main part of the OS) is shared between them. Bare metal is perfect for running processes that do NOT need any separation (ie due to security requirements).

Pentium100 · March 1, 2024, 3:59pm

Containerization is more complicated to set up and set up correctly. I know it can be made to “almost” behave like VMs, but, again, more complicated. VMs are simpler.

Consider something like md-raid. It has a speed limit. Despite running on bare metal (no point in using it inside a VM) and the rebuild process having lower priority, it still is additionally limited to some rate to not impact latency too much.

I think that having a speed limit for the filewalker would be a good thing, but, as it is a separate process I probably should be able to limit it from inside the VM - let the uploads/downloads use as much IO as they need, but throttle the filewalker to never use above 50% IO or something.

It finally finished
graph_image-2

The IO load on physical drives was not look that bad, but still
graph_image-2
But I did not like the load on the host

Besides, just running something at low priority, but letting it use all available resources may not be the best option. For example - higher CPU load means more heat and the fans spinning faster meaning more noise. Even if it’s low priority, the CPU is still running at max power. Throttling the offending process, forcing it to leave some CPU at idle, lowers the noise.

Mitsos · March 1, 2024, 4:51pm

Containers are 6 clicks away (well, on proxmox anyway). Set up the container, set up the password, add an interface, add storage, done. It is the exact same steps as setting up a VM.

Toyoo · March 2, 2024, 1:10am

This is your OS role to correctly implement low priority I/O, not an application. If an application signals that its I/O is low priority, that should be enough.

As such, you should be requesting this feature probably in VirtIO, not in Storj.

Pentium100 · March 2, 2024, 6:02am

OK, I figured it out, since apparently I’m the only one running a node inside a VM and having this problem.

#!/bin/bash
if [ ! -e /sys/fs/cgroup/blkio/filewalker ]; then
        mkdir /sys/fs/cgroup/blkio/filewalker
        echo "8:0 31457280" > /sys/fs/cgroup/blkio/filewalker/blkio.throttle.read_bps_device
fi
fw=`ps aux | grep filewalker | grep -v grep | awk '{print $2}'`
for process in $fw; do
        echo $process > /sys/fs/cgroup/blkio/filewalker/cgroup.procs
done

This creates a cgroup, limits the IO to 30MB/s and if it finds a process with the name "filewalker, it puts it in that cgroup. I added this to cron, so it runs every 5 minutes.
It would be better if I could limit it dynamically based on the load (so that when it reads cached data it can run faster), but it looks like cgroups can only do fixed bps or fixed IOPS limits.

Looks like it’s working

And this should not affect the uploads and downloads of the node, so I should lose fewer races while filewalker is runnning.

Mitsos · March 2, 2024, 6:58am

What are the specs for the storj VM? (core/thread count, ram, space)

Pentium100 · March 2, 2024, 9:45am

4 cores, 16GB RAM, currently used 27TB

The host has 2x Xeon 5687 CPUs and 192GB RAM. Storage pool is made of 3x 6-drive raidz2 vdevs (4TB, 6TB, 8TB).

Balage76 · March 2, 2024, 10:32am

I see the same piecestore monitor error on my windows machine with 4 nodes on 4 HDDs. I never had this error until node version 1.92. The error happens on all four nodes/drives randomly, but only at one drive per time, not simultaneously.
I run chkdsk even with surface tests, but not a single error on any HDDs.

Mitsos · March 2, 2024, 10:47am

Normal load for that is 4 x 1.5 = 6. Anything less than 6 is normal system usage. Taking into consideration that as soon as one filewalker (ie GC) starts, you get yourself into a situation that it will not complete when the next one starts. You now have two filewalkers limited @ 30MB/s total (ie 15+15). They will not complete by the end of the week, and you now have 3 filewalkers limited @ 30 total (10+10+10). I doubt they will all finish in the next week, so you now have 5 filewalkers (1 original + 2 in first week + 2 in second week) limited @ 30 (6+6+6+6+6). Let it run for a month, and you’ll get the saying that “if all you have is a hammer, everything looks like a nail”.

Your storage is slow. That is the reason CPU load shoots up. As I previously said it is all “wait” load and you can verify this by running top inside the VM,. You may have pretty capable hardware, but you are using it wrong. You have 2X4CX2T=16 threads (16 nodes if we go by the ToS). You can barely keep one node up and running. If you don’t see that there is something wrong with that, I can’t help any further.

EDIT: fixed my load calculation. we all have brainfart moments.

Alexey · March 2, 2024, 10:55am

kind of expected a slow down. A lot in many cases, unfortunately.

100%

it’s true. However, is it possible to do not use VM? use docker/LXC?

Alexey · March 2, 2024, 11:00am

much more easier than VM actually. And uses much less resources.