Tuning the filewalker

odarriba · July 28, 2022, 10:42am

The commonly called filewalker process, which is the functionality that goes over all pieces of the storage node’s blob folder seeking the metadata and looking for expired pieces, is basically smashing the spinning drives every time a node is restarted (both on manual restart and/or automatic upgrade).

Due to the way Storj files are stored (folders with thousands of files that are created/deleted) causes some issues on Ext4 filesystems after some months of operation. They do not only cause file fragmentation, but also directory fragmentation which leads to slower directory traverse operations - and longer times with disk at 100% usage.

In an ext4 formatted CMR drive (WD shucked) which has only held one Storj node (nothing else), I’m observing 31,75% of directories fragmented and 13,25% of files fragmented. When that node (9.75 TB) is restarted, it takes 20-24 hours at 100% usage to finish the filewalker operation.

Some references of the pain that is causing the process:

Can the process be done in other lighter way?

If we need to keep track of each file piece, I can imagine using another sqlite database instead of reading all inodes on the disk, which can be moved to an SSD if needed for performance.
To track that all pieces are present, we have the audit system.

Or at least, allow us to set some tuning flags to adjust the “intensity” of the process to allow us to run it for more hours but in a lighter way - not getting 100% IO and causing higher IO waits.

Toyoo · July 28, 2022, 1:07pm

This would penalise node operators who do not have an SSD available. The storage node code has already been moved away from sqlite for the orders storage simply because it was too slow. Yet I do agree that some form of metadata tracking outside of the file system might be a good idea here.

In general, while I do agree that the file walker process is annoying and there should be some ways to avoid it, the way storage nodes currently operate has the features of being simple in terms of code and already battle-tested. Introducing additional complexity in form of new code will require a lot of work, and as far as I understand, development of customer-facing features has a lot more priority. Note that in the recent releases there’s a lot of commits being done in the satellite code, but comparatively small number in the storage node code. So I wouldn’t hope for a solution to arrive quickly unless maybe someone from the community steps up and prepares some code on their own.

Toyoo · July 28, 2022, 1:18pm

Oh, one more thing. I do have a hypothesis that getting rid of the two-level directory structure («first two letters of PieceID»/«the rest of PieceID».sj1) may speed up the file walker considerably. I have some preliminary measurements, but I need to test this more carefully. It may seem that ext4 is better at handling a single 500k-file directory than hundreds of directories of hundreds of files each, probably because of the H-tree optimisation triggered only when a single directory is large enough.

odarriba · July 28, 2022, 1:48pm

The same it is penalising the SNOs with long running nodes, which suffer from this issue But I get you point about that.

Also, running the filewalker right now is consuming almost all IOPS of the drive, so if you are running the databases there, the chances of corruption are also higher.

Exactly! Maybe even a tracking file per two-letter directory, or even per satellite. I can imagine for example a single “used-space” file per two-letter directory and a modified file walker that goes to a random one each few minutes to re-check, if needed.

That would maintain the check without smashing the drives.

And that is 100% understandable, as the business operates thanks to DSNOs but the $$ is on the clients. More clients is better for Storj and SNOs.

SGC · July 28, 2022, 2:04pm

i wonder if the storagenode could clean up the fragmentation during usage…
like say if data is being read for egress, then it could rewrite it sequentially if the data wasn’t sequential to begin with…

ofc this would introduce even more workload and drive wear.

BrightSilence · July 28, 2022, 2:18pm

This used to be how it worked, but it caused a lot of issues. Since files are transferred in parallel, you will have multiple threads trying to update the same sqlite database, which uses basic file system locking to allow for changes to be made. Transfers constantly failed due to the locked database, interrupting node operation and customer experience.
The file metadata is now stored with the file, because it’s only ever needed really to transfer that file. This is by far the fastest way to do it when it comes to the transfers themselves.

Additionally, if you store the metadata elsewhere, you also run the rest of that getting out of sync. If you want to check that, you still need to walk all files to make sure the data is still correct.

I don’t know if there is a great solution

odarriba · July 28, 2022, 2:48pm

As pointed above, we can have for example a lighter file walker that checks the files but not in such heavy way. Check a two letter folder and wait a bit for the next one… something that doesn’t hit our IOPS so hard.

As for the multithreading, I’m quite sure there are solutions for that like spawning a separate process just in charge of queueing pending updates and wite them synchronously when needed.

littleskunk · July 28, 2022, 5:35pm

I just disabled the file walker. Problem solved. If you try the same the next expensive process will be GC. I can’t disable that one. Instead I am now rejecting small file uploads. I better don’t store these tiny files in the first place and hopefully I can clean up my hard drive over time.

There is one other thing I would like to try. I have a 8 TB SSD available and want to use that as read cache for ZFS. I haven’t found a config that would tell ZFS to keep all the metadata in the read cache but maybe ZFS is smart enough to do that on its own just based on the behavior of the filewalker.

Bivvo · July 28, 2022, 5:44pm

So is it advisable in general from Storjling side to all SNOs to do so? Sry for this stupid question, I’m a bit lost. And of course do not risk the functionality of my node.

littleskunk · July 28, 2022, 5:57pm

More like an advise from one SNO to another SNO. As a Storjling I would not recommend it but as a SNO I had no other choice for the moment.

The only issue I might see on my node is that the used space numbers might get incorrect. Long term I just want to run the file walker let’s say once per quarter or so. That should allow me to keep my used space value reasonable accurate without having to wait on it every time.

The used spae value is cached anyway and updated on every upload and delete. I didn’t notice any inaccurate value yet. So no negative side effect from disabling the file walker yet.

Rejecting small piece is a different story. If we would all do that we would screw up some customers. Maybe not the best idea. How about you store the small pieces that I am rejecting?

At the moment it looks like the current customers are uploading mainly big files. I would be willing to tollerate that situation for now. There are some ideas in the pipeline how that problem could be resolved. Let’s say we get a customer that uploads only small files for some reason and the company ignores the pain we have with these tiny files. In that case I would be willing to team up and we all reject the tiny pieces to force a change. At the moment it is too early for such a move.

Bivvo · July 28, 2022, 6:18pm

Haha. Ugh. No.

jammerdan · July 28, 2022, 7:07pm

How can this be done?

littleskunk · July 28, 2022, 7:19pm

In the code here: storj/cache.go at bd36a41a9ebe4d855ad654daf3856728e2796d98 · storj/storj · GitHub

odarriba · July 28, 2022, 8:18pm

I really like the idea of the filewalker only being launched ervery month or so (maybe even once per week, but only one satellite each time!).

Am I correct thinking that if the storage node shuts down gracefully (receiving a signal, like when you stop a docker container) it will safely store the latest storage used information? If it is like that, I can live with that and maybe launching it manually once per month or so.

I’m also thinking about that: adding a ZFS pool per node (to keep the one node-one disk) and adding a metadata special vdev to speed up this kind of operations.

Is there any chance of adding a config flag for that?

I’d even try to add it myself and submit a pull request, but due to the fact that you should know what you are doing before disabling it I’m not sure if Storj team would like such flag on official releases.

Thanks a lot for your insights on the code. It is really appreciated.

Toyoo · July 29, 2022, 8:40am

I was thinking of doing something like that, except maybe less brutal: instead of rejecting small files outright, do not accept them as quickly, by putting a conditional sleep() somewhere in the code. Just let other nodes outrun mine for these files.

Though, here the file walker process doesn’t bother me much, this would rather be a potential solution in case I’d start running out of inodes.

SGC · July 29, 2022, 11:04am

the ARC will learn what is most advantageous to keep, this will then be evicted to the L2ARC as the memory fills.
to make the L2ARC only store metadata, the ARC will need to only store metadata…
which isn’t advantageous since databases and other repetitive workloads will the not be cached.

apparently the max metadata % is 75% of the ARC, i thought it was much lower…

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-module-parameters

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-meta-limit-percent

so basically you just need to add the L2ARC to get the effect, however there might be other parameter that is required to be changed if it isn’t allowed to evict metadata for other more useful data.

also do keep in mind, it will keep writing and writing and the writing some more… so you will need a certain endurance level on the SSD.
for a single node, you should expect something like a 200KB/s to 500KB/s sustained writes for many months maybe forever…unless if you change other parameters.

so at 500KB/s that is 1.3TB write wear per month pr node.
sure it might only be half that, but you also don’t want to go below 50% wear on the SSD to quick, some of the lower end rated TBW is like 150…
so should be fine in most cases, but just be very aware that it can wear down a drive, depending on the workload.

odarriba · August 1, 2022, 9:32am

I have submitted a pull request to make it configurable here.

Not sure if it will be something that Storj wants to be configurable, but just in case

littleskunk · August 1, 2022, 9:37am

Looks good to me. Just one small comment on the default value.

One question for the group. Is the name of the config flag looking good or has anyone a better proposal? (I don’t copy the name because my evil plan is to get you all in touch with the small little code change for more contributions in the future)

odarriba · August 1, 2022, 9:39am

I was between storage2.initial-check and storage2.initial-piece-scan (which maybe more clear)

Of course open to suggestions, as the better the naming the better the feature

Also, looks like tests are breaking, will check now.

Toyoo · August 1, 2022, 10:07am

Given that you ask… (sure way to bikeshedding :P) When naming, I’d probably focus on the purpose of the scan, not timing. So if this process is performed to refresh disk usage statistics, I’d probably think of sth like UsedSpaceScan.