Tuning the filewalker

odarriba · July 28, 2022, 10:42am

The commonly called filewalker process, which is the functionality that goes over all pieces of the storage node’s blob folder seeking the metadata and looking for expired pieces, is basically smashing the spinning drives every time a node is restarted (both on manual restart and/or automatic upgrade).

Due to the way Storj files are stored (folders with thousands of files that are created/deleted) causes some issues on Ext4 filesystems after some months of operation. They do not only cause file fragmentation, but also directory fragmentation which leads to slower directory traverse operations - and longer times with disk at 100% usage.

In an ext4 formatted CMR drive (WD shucked) which has only held one Storj node (nothing else), I’m observing 31,75% of directories fragmented and 13,25% of files fragmented. When that node (9.75 TB) is restarted, it takes 20-24 hours at 100% usage to finish the filewalker operation.

Some references of the pain that is causing the process:

Can the process be done in other lighter way?

If we need to keep track of each file piece, I can imagine using another sqlite database instead of reading all inodes on the disk, which can be moved to an SSD if needed for performance.
To track that all pieces are present, we have the audit system.

Or at least, allow us to set some tuning flags to adjust the “intensity” of the process to allow us to run it for more hours but in a lighter way - not getting 100% IO and causing higher IO waits.

Toyoo · July 28, 2022, 1:07pm

This would penalise node operators who do not have an SSD available. The storage node code has already been moved away from sqlite for the orders storage simply because it was too slow. Yet I do agree that some form of metadata tracking outside of the file system might be a good idea here.

In general, while I do agree that the file walker process is annoying and there should be some ways to avoid it, the way storage nodes currently operate has the features of being simple in terms of code and already battle-tested. Introducing additional complexity in form of new code will require a lot of work, and as far as I understand, development of customer-facing features has a lot more priority. Note that in the recent releases there’s a lot of commits being done in the satellite code, but comparatively small number in the storage node code. So I wouldn’t hope for a solution to arrive quickly unless maybe someone from the community steps up and prepares some code on their own.

BrightSilence · July 28, 2022, 2:18pm

This used to be how it worked, but it caused a lot of issues. Since files are transferred in parallel, you will have multiple threads trying to update the same sqlite database, which uses basic file system locking to allow for changes to be made. Transfers constantly failed due to the locked database, interrupting node operation and customer experience.
The file metadata is now stored with the file, because it’s only ever needed really to transfer that file. This is by far the fastest way to do it when it comes to the transfers themselves.

Additionally, if you store the metadata elsewhere, you also run the rest of that getting out of sync. If you want to check that, you still need to walk all files to make sure the data is still correct.

I don’t know if there is a great solution

Toyoo · August 5, 2022, 9:12am

Revisiting this topic, I’ve noticed that you’re giving these numbers. Maybe I forgot about them before, sorry. I do not observe that the file walker takes so much time. Here it was ~12 minutes per terabyte (with cache cleared), which got down to ~8 minutes per terabyte with -I 128, and even less than that without clearing cache. So, more than an order of magnitude less than yours.

Yet Stob’s observation suggests the file walker process also hits >24h times.

I now wonder whether there are any common qualities between your and @Stob’s nodes that make the file walker so long.

It would also definitelly help if we could have more precise measurements of the time consumed by this process, as opposed to guessing it from IO usage graphs.