It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected

pdeline06 · June 10, 2024, 5:59pm

“save-state-resume feature for used space filewalker” (resume from the last point it stopped) - is it already working correctly?

jammerdan · June 10, 2024, 6:20pm

As far as I see it, it has not been merged:

https://review.dev.storj.io/c/storj/storj/+/12806

It seems new features are getting added:

https://review.dev.storj.io/c/storj/storj/+/13421/2

pdeline06 · June 10, 2024, 7:36pm

but build verify is finished successfully…need waiting for merge?

This makes sense if 1 point is working

jammerdan · June 11, 2024, 4:59am

I wish they would have released it before adding new features.
I am waiting desperately for this feature for weeks now for my nodes where I had to turn the filewalker off because it would not finish.
I have no information about the size on them currently.

elek · June 11, 2024, 7:43am

The process:

build-verify → this is like a quick (?) smoketest
two :+2: reviews
build-premerge (reamaining, long-runnint integration test)
merge

elek · June 11, 2024, 7:56am

I totally agree with you, we need this feature, but I couldn’t resist to add my wider notes

The problem with filewalkers is retrieving the size and last modification time. With ext4 file system, directory entries contain only the file names. An additional read of the inode is required for getting the size and last modification time of the files.
Assuming you have 10M pieces, it’s at least 10M additional IO operation. With 200 IOPS/sec, it’s 10M / 200 / 60 / 60 = 13 hours to complete
This particular feature (save-state-resume GC file walker) can help to survive restarts, but won’t solve fully the problem
We write pieces files only once. It should be possible to cache size/modification time information
If you have enough memory, OS can also cache. I have experience with a server with 126Gb memory. Because majority of the ram is not used, Linux is happily uses 66Gb to cache inodes → the filewalker can finishes in 5-20 minutes, even if I have ten millions of files. (That’s the easiest workaround. If you have memory, just put it to the server)
@littleskunk reported similar results to use RAM (and SSD) for ZFS cache.
But let’s say you don’t have enough ram. I am doing some experiments, and I found that a simple db based cache still can help (49m walking for 14M pieces with hot cache. With cold cache it was 15h, without cache it was 13h)

My personal opinion is that walkers should be improved (in addition to save-state-resume), and make them faster with caching OS level size/creation time.

Roxor · June 11, 2024, 8:34am

Are those IO constraints related to the max-space-per-node recommendation (24TB)? Like if you have 3-4mil small files per-TB… at some point the filewalker-type housekeeping tasks just have too many files to check all the time?

elek · June 11, 2024, 9:23am

It’s a different type of recommendation, I guess. With 24TB you may have other limits as well.

This problem depends on the disk IO + file system. You can be a happy owner of a 24TB with good disk (for example with RAID0, you may have better IOPS bandwidth). Or using ZFS + SSD cache.

I have also seen dozens of storagenodes running on a single 1.7TB NVMe disk (QA satellite only for testing), without any io pressure.

I would re-phrase recommendation like this:

don’t use more than 24TB for one Storagenode instance (as written in that page). May work, but it’s not really tested…
as a rule of thumb: use one SN process per disk
maximum space should depend on the max IO speed. (If you can use only slow HDDs, it can be better to use more, but smaller disks)
If you have big and slow disks, it’s better to have huge memory (at least for ext4)

jammerdan · June 12, 2024, 2:49am

If it helps with restarts better, fine. Also making it configurable is good. I had my own thoughts on that here and here , I don’t know if @clement is aware of them.

But also it would be important to get the base stop-resume feature out and add the improved features later.
It is important to get to a state to receive reliable information from the node which is currently not the case in many areas and different reasons unfortunately:

Used space not correct
Trash folder not updated
Bandwidth display not correct

Then we see glitches on satellites, avg. used values not consistent etc.
I know some of it has already been fixed but it the fixes did not yet arrive on Docker nodes.

Currently the situation is that the used space filewalker would not complete on some nodes which means that it continuously restarts and retries from the start. This is why I need the feature to get it done and the used space updated at least once.
I agree that it does not really solve the underlying problem that is too much IOPS needed for those operations. That is also why I made my previous suggestion not to always repeat the filewalker on restart when it is not necessary but make it configurable when it runs.

When you are referring to inode cache do you mean vfs_cache_pressure? I see that it can influence the tendency what the OS caches but I don’t know how useful is that. Is it better to cache inodes or pieces for the customer? As I understood it, the used-space filewalker is basically a one time thing at least as long as the node is running so basically “wasting” cache for that instead for pieces data to serve to customers sounds wrong. But maybe caching inodes also helps with the pieces for customers, I don’t know, maybe…

Of course if there are ways to cache better or more, this would be good.

At least it sounds like this is something that could be doable, to not rely on inode as we basically have all the data from the moment when a piece gets written and to store this information for filewalker use. Maybe this information could be even moved to a different disk or even ramdisk then, freeing the data disk from inode reads for that purpose altogether.

Together with database I could not agree more.

Alexey · June 12, 2024, 6:09am

This is already possible for zfs for example, exactly what you suggest. It wouldn’t be a ramdisk directly, but RAM (+SSD?) used to cache this metadata.
It’s partially possible for ext4 too:

Mitsos · June 12, 2024, 6:19am

Cache isn’t just used once, it’s used every time the file(metadata) is needed. That could be every time the GCs run for example.

There is no point in caching pieces for customers. Resources are better spent on caching information about the millions of pieces that the node is storing, speeding up any internal proccess (used space, GC, trash-cleanup).

jammerdan · June 12, 2024, 6:24am

Right, I forgot about the other filewalkers.

What vfs_cache_pressure value do you suggest?

Mitsos · June 12, 2024, 6:42am

I don’t change it for my systems.

elek · June 12, 2024, 6:52pm

Neither me, but without changing, I surprised that Linux did exactly what I wish:

Having lot’s of unused memory:

And significant part of the cache (yellow lines) spent to cache the inodes (ext4_inode_cache):

(first is from htop output, second is from slabtop)

Mitsos · June 12, 2024, 6:58pm

When in htop: press F2 (setup) > down arrow to meters > right arrow until you get to memory > press space. it should be like this:
Screenshot_20240612_215737

This is the result:

Edit: because discourse is doing its thing where replied-to gets evaporated into thin air, @elek

Alexey · June 13, 2024, 7:34am

2 posts were split to a new topic: Discourse vape out topics

jammerdan · June 13, 2024, 7:32am

I see things are going to be discussed: Storage node performance for filewalker is still extremely slow on large nodes · Issue #6998 · storj/storj · GitHub

Please do not delay this feature any further even if you might agree on other solutions to improve the filewalker situation.

jammerdan · June 13, 2024, 12:02pm

Here is one problem I have now as I had to turn off the regular filewalker:

I should be gone with the save-state-resume feature working.

jammerdan · June 15, 2024, 1:47pm

What if we add that to the filename?
Something like piecename.sj1.year-month-day e.g.: Taking a random piece from one of my nodes: xwx3njdry2vhdthddq23773yfwcm4tv73qdpv65o7topw3pepq.sj1.2024-02-07.
Then we have everything when we have the files name. We could even add the size:

xwx3njdry2vhdthddq23773yfwcm4tv73qdpv65o7topw3pepq.sj1.2024-02-07.2319872

With some simple text manipulation operations you have all information about the file. I just tested, ext4 does accept such a filename.

Toyoo · June 15, 2024, 8:54pm

Then you make downloads much slower. You can’t “just” open a file, you have to scan the whole subdirectory each time to figure out what are the possible file names for a given piece ID.