Tuning the filewalker

Toyoo · December 12, 2022, 7:56pm

This is a subjective observation, but my HP Microserver Gen7 started being much more responsive after I added more RAM. The box handled ~12 TB of nodes at the time with 2 GiB of RAM, I changed that to 16 GiB. I wouldn’t be surprised if the same was with a Synology device.

Could you try defragmenting your directories using e2fsck -D? This was suggested before in another thread, it would be interesting to see if it helps in your case. This operation likely requires downtime though, as the file system cannot be mounted during the process.

On my Microserver the file walker takes around 8 minutes per TB in a similar setup (ext4, no RAID, noatime, no caches, etc., with probably the only difference being I’m setting the inode size to 128 bytes)

snorkel · December 13, 2022, 4:20pm

What is “inode”?
I’m not familiar with linux, just copy paste commands. What do you mean “file system can not be mounted”? I can stop and rm node… other that that I don’t know.

Knowledge · December 13, 2022, 7:22pm

I’d add more RAM. 2GB is pretty thin. You would want the cache it provides by adding more. Whether or not that will help with File Walker, I am not sure. Your performance is below average, so something in your configuration is causing it. Are you sure you aren’t using Synology’s default RAID, which I believe is ZFS? Doing straight reads though, I wouldn’t think it would be that slow like you are seeing.

snorkel · December 13, 2022, 8:01pm

No raid or zfs, just basic and ext4. I have a system with 6GB RAM; I’ll do another test.
Is just very intriguing that no one sees these filewalker times on rasp pis or weeker systems. Or Synology just makes so crappy builds? Or… they sabbotage other enterprise HDD brands to promote their own. I saw an add saying that their HDDs have 23% more performace than competition. How? When no spectacular improvements have been made to SATA 3 7200 rpm drives? The tech is just at the top of evolution, and the drives are pretty similar in performance.

Knowledge · December 13, 2022, 9:02pm

I believe Bright Silence runs at least one node on a Synology, so he may have some recommendations.

I have many single board computers, include Raspberry Pi’s with USB drives, but they are not carrying over 4TB each. File Walker churns on them but it’s not long enough for me to notice. I’m not restarting these nodes very often, other than when there is an update. So that also reduces any potential noise. It does seem like our users that have larger disks experience bigger waits with the file walker process. Which obviously makes sense. I just wonder if the larger drives compound the problem, or it is just a case of more files more drive activity.

snorkel · December 13, 2022, 9:35pm

Many SNOs with big drives are more experienced than me, and they use cash setups (ssd, ram, etc). Maybe that’s why nobody observed this problem. Storagenode suppose to be run by anyone with a spare drive. So advanced setups should not be a necesity. I think I followed all the docs recommendations, no complicating things unnecesary, I use a system that is build for 24/7 data storage, enterprise HDD. So I think I did my best here. It’s realy disappointing realising that I spent all that money for a bad setup, and with no clear explanations why is that; why the performance is so bad…

Pac · December 13, 2022, 10:17pm

I think many SNOs just don’t monitor their nodes much…
The filewalker does take a lot of time on my nodes too, but that’s mainly because most use 2.5" SMR drives. I think the slowest one takes around ~12 hours for roughly 4TB stored.

And for a reason I don’t get, it was even worse when my RPi 4 (with 4GB RAM) was running my nodes with Raspbian 32bit. Now that I switched it to 64bit, the filewalker is twice as fast.
No idea why

I agree with @Knowledge: more RAM is always better as the OS can use it to cache many useful info: it usually makes a system faster after some uptime.

Pentium100 · December 13, 2022, 10:32pm

Those who monitor their nodes are probably mostly the “more experienced” ones who may know how to optimize it or are using a better setup.

Yeah, more RAM would speed things up.

Toyoo · December 13, 2022, 11:45pm

Then it’s better if you don’t try, sorry. There’s risk of losing data with these commands, I don’t want you to feel I’m responsible for any loss.

Roberto · December 14, 2022, 10:20am

@snorkel
I believe you are not alone. I have a Qnap Ts230 with 2 Wdred 8Tb and I encounter the same problems. I think the nas doesn’t keep up with the requests when the filewalker or garbage collector starts. I had to restart the nodes to get the error out.

odarriba · December 14, 2022, 10:45am

From my experience (I am running multiple nodes, and have dealt with this several times) it is a kinda complex situation. Let me explain:

Storj has some processes (filewalker, garbage collector) that needs to go through the files in the disk to read the metadata (filename, size, etc). That is a really expensive operation as node starts to grow over a certain point, because the number of files on each folder increase accordingly with the storage used.
Linux has mechanisms for that: a special cache called inode cache which stores information about inodes (which are like “the pointers” to data in the disk) and the dentry cache which stores information about directories. Those caches avoids a lot of I/O to the disk, but it is stored in RAM.
That’s the reason why more RAM equals less impact of those processes, specially on restarts of the node (like updates) because the computer is not restarted and those caches are already populated.
There is another situation happening, which is the garbage collector of the Linux kernel for those caches. When it is almost full, for every entry that Linux tries to cache, it has to free some space first, and that requires an extra logic to select which old entry it ahs to remove.
That’s the reason why a cold start is more performant than later executions on RAM constrained systems.

Taking into account that the garbage collector is needed, and that we should at least run the filewalker once every few weeks (there is nothing written on this matter but, at least myself, don’t want to get the space occupied information wrong), the best solution for that is to have RAM accordingly to the size of your nodes. I’d say that 2 GB for a 10 TB node is not enough memory, but your mileage may vary.

Oh! And regarding missing some requests when this situation is happening, just take into account that on that moment:

you disk is full of IO getting metadata and traversing directories
you CPU is fully occupied waiting for the disk (IOWAIT)
your remaining CPU time is trying to figure which entries of the RAM caches can be removed to store new elements.

So it is expected that on resource constrained systems it will suffer a bit. The way to check that is using a tool like htop and check the load average of your system. If it is much bigger than the number of available cores, nasty things are taking your CPU away of serving Storj requests.

BrightSilence · December 14, 2022, 2:40pm

Feel free to @ mention me if it’s relevant. I kind of stumbled across this mention now. Yes, I run 12 nodes on my Synology. 4 on the internal array (SHR-2 with R/W SSD cache of 1TB), the rest on individual disks in a 10-bay external enclosure. I think the fact that I have one of the Xeon Diskstations and use a sizeable SSD cache makes my experience not very representative. I also have 16GB of RAM. The file walker on my largest node (20TB) runs for a few hours for what it’s worth. But I also have other things going on on that array and 3 other nodes running on it. Most of this hardware pre-existed my Storj nodes, so yes, it serves other purposes as well at the same time. The filewalker on the external HDD’s seems faster, but they are also smaller and do nothing outside of Storj. So I’m not even sure how much the SSD cache helps and I’m not gonna remove it to find out. I fear it won’t be pretty.

I wouldn’t recommend doing this on Synology unless you really know what you’re doing. It’s a lot more complicated since you would have to manually shut down services and unmount stuff, which Synology doesn’t make easy. And I can tell from experience that Synology support isn’t very helpful in helping you through this. I was forced to do something similar at some point to fix a volume that refused to expand because of file system issues. It was a pain.

Sidenote: Whenever I ask Synology for support, they always ask for my admin password… I seriously can’t believe that is still standard practice in this day and age. Since I have always refused to provide that, they don’t really help you along with commands on how to fix it yourself either. So you’re kind of on your own.

snorkel · December 14, 2022, 7:01pm

I done another test. Man, it’s a huge difference. The extra RAM helps big time.
On Synology DS220+, 6GB RAM, Exos X16 16TB SN04, ext4, etc. the Filewalker took 2h 30’ for 4.32TB of data, so 0.58h/TB. Huge huge difference! From 2GB to 6GB, FW time dropped from 4h/TB to 1h/TB. So for anyone interested in reducing FW time, increase your RAM.

snorkel · December 15, 2022, 3:07pm

Whould be useful some official guide with the recommended RAM for different node sizes. The “500GB” minimum is very missleading.

Toyoo · December 15, 2022, 3:47pm

I’ve read that code some time ago and I don’t recall that there was anything specific that would require a lot of RAM during the file walker process. Yet we observe that adding RAM helps a lot, so we can’t deny there is a problem—this is probably the most evident observation possible pointing to RAM being the bottleneck here.

It would be nice if someone with debugging skills who observes the problem on their node could investigate it. I wouldn’t be surprised if a slight tuning to a single component, let say, golang’s garbage collector, actually fixed the problem… like, maybe setting the GOMEMLIMIT environmental variable, or sth like that.

Pentium100 · December 15, 2022, 3:56pm

Filewalker accesses the metadata of every file. The more metadata the system can keep in RAM cache, the faster filewalker runs. If you have enough RAM to hold the entire metadata, the filewalker runs the fastest. I do not know how to calculate how much metadata is in a particular partition.
The data parition on my node has 29695764 used inodes and one inode is 256B in size, so to cache all of them the VM needs 7.6GB, but there are also directories and probably other stuff that needs to be cached.

Toyoo · December 15, 2022, 4:00pm

Yet this process with cold cache on my machine eats only around 8 minutes/TB. With cold cache, meaning you can’t count on the inodes and dirents being already in cache.

snorkel · December 15, 2022, 4:15pm

Maybe those buffers in Buffered memory are the bottleneck. I just found out what are cache and buffers, so I’m just making guesses. What can be seen in my screenshots is that the free ram is very small, the used ram and cache are also small, only Buffer is huge.

odarriba · December 15, 2022, 4:25pm

The higher memroy consumption is mostly the buffers of uploads waiting for thew disk to be able to write there.

Filewalker eats up all the I/O of the disk, so other processes have higher access time for writting/reading.

Pentium100 · December 15, 2022, 4:38pm

Some metadata (directories etc) probably gets accessed multiple times. If there is not enough RAM to keep it in cache, everything slows down. Then there are the databases, which probably should be cached as well, at least some of them. Some of the things probably get cached before they get accessed the first time (read-ahead cache).

If the filewalker itself needed a lot of memory, it would show up as the node process using RAM.