It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected

elek · June 11, 2024, 7:56am

I totally agree with you, we need this feature, but I couldn’t resist to add my wider notes

The problem with filewalkers is retrieving the size and last modification time. With ext4 file system, directory entries contain only the file names. An additional read of the inode is required for getting the size and last modification time of the files.
Assuming you have 10M pieces, it’s at least 10M additional IO operation. With 200 IOPS/sec, it’s 10M / 200 / 60 / 60 = 13 hours to complete
This particular feature (save-state-resume GC file walker) can help to survive restarts, but won’t solve fully the problem
We write pieces files only once. It should be possible to cache size/modification time information
If you have enough memory, OS can also cache. I have experience with a server with 126Gb memory. Because majority of the ram is not used, Linux is happily uses 66Gb to cache inodes → the filewalker can finishes in 5-20 minutes, even if I have ten millions of files. (That’s the easiest workaround. If you have memory, just put it to the server)
@littleskunk reported similar results to use RAM (and SSD) for ZFS cache.
But let’s say you don’t have enough ram. I am doing some experiments, and I found that a simple db based cache still can help (49m walking for 14M pieces with hot cache. With cold cache it was 15h, without cache it was 13h)

My personal opinion is that walkers should be improved (in addition to save-state-resume), and make them faster with caching OS level size/creation time.