Following my previous experiments, I’ve tested whether it makes sense to have two-level directory structure in the storage
directory. Currently the node will store each blob under a path that looks like: pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz/67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1
. This means that a satellite whose encoded ID is pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa
requested to store a piece whose encoded ID is fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq
. Note how the first two characters of a piece ID is used as a subdirectory name. This is a long-used practice aiming to avoid having directories with too many files: by splitting a collecting between many subdirectories, no single directory will have a large amount of files.
This practice was invented for two reasons. First is that some file systems put an explicit limit on the number of files in a single directory. For example, FAT32 limits the number of files in a single directory to 65535, which is definitelly not enough for a storage node, which may manage millions of files. Second is that for some file systems, even if they allow a large number of files in a single directory, their operation becomes slow when this happens, e.g. because finding a file in a directory requires a linear scan of the directory contents.
However, all file systems considered modern implement efficient data structures for managing large directories, usually in form of a b-tree. This means that neither of the problems described above should happen there. Some of these file systems are ext4, btrfs, zfs, ntfs — these basically cover all popular choices for managing a storage node. If so, what would happen if a storage node could reasonably expect the file system to have these properties and not emulate a tree on its own?
I’ve collected some statistics using the setup I’ve described previously. The baseline for this experiment is the no-sync version of code. The only change in the “flat” version of code is that there is no two-letter subdirectory to store blobs, e.g. the example path above becomes just pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1
.
Total time to run the log:
ext4 | btrfs | |
---|---|---|
baseline | mean=34981 seconds, sd=874 seconds | mean=34490 seconds, sd=195 seconds |
flat | mean=25438 seconds, sd=985 seconds | mean=25379 seconds, sd=478 seconds |
We see an about 40% speedup coming from the flat directory structure in both ext4 and btrfs, at the cost of slightly bigger variation in results. However, the second set of numbers might be more interesting.
Time to run a directory scan reproducing the file walker process:
ext4 | btrfs | |
---|---|---|
baseline | mean=204 seconds, sd=10 seconds | mean=1470 seconds, sd=35 seconds |
flat | mean=228 seconds, sd=18 seconds | mean=769 seconds, sd=10 seconds |
This is curious! Ext4 seems to like here what a storage node is doing (though the difference is not that big), but btrfs would really prefer having a flat structure! As such, the choice to have subdirectories may have an impact on the quality (raw performance? latency?) of the node’s operation.
The directory scan was performed while no other I/O activity is performed, this including handling normal requests. I do not have easy means to have a more faithful testing scenario, but I suspect that with regular activity the numbers for ext4 would similar for both the baseline and flat scenario, taking advantage of the fact that regular operation would just have a smaller impact on the file walker process.
Again, the results may be different for zfs and ntfs, and given that ext4 and btrfs already differs a lot, it’s impossible to generalize. Taking all of the above into account, I wish there was a switch in the storage node code to select either using a two-level directory structure or a flat structure, so that the storage node operator can decide on their own how to organize data. Alternatively, I’d probably prefer hardcoding the simpler, flat structure, as opposed to the current two-level structure, as by the numbers presented it seems safer for btrfs, while not impacting ext4 that much. Yet either of these options would require writing code to safely migrate data from the current structure, which is probably not trivial. As I’ve collected these statistics anyway, I’ve decided to show them in hope that maybe at some point they will be useful.