On the impact of two-level directory structure for storing blobs

Following my previous experiments, I’ve tested whether it makes sense to have two-level directory structure in the storage directory. Currently the node will store each blob under a path that looks like: pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz/67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1. This means that a satellite whose encoded ID is pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa requested to store a piece whose encoded ID is fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq. Note how the first two characters of a piece ID is used as a subdirectory name. This is a long-used practice aiming to avoid having directories with too many files: by splitting a collecting between many subdirectories, no single directory will have a large amount of files.

This practice was invented for two reasons. First is that some file systems put an explicit limit on the number of files in a single directory. For example, FAT32 limits the number of files in a single directory to 65535, which is definitelly not enough for a storage node, which may manage millions of files. Second is that for some file systems, even if they allow a large number of files in a single directory, their operation becomes slow when this happens, e.g. because finding a file in a directory requires a linear scan of the directory contents.

However, all file systems considered modern implement efficient data structures for managing large directories, usually in form of a b-tree. This means that neither of the problems described above should happen there. Some of these file systems are ext4, btrfs, zfs, ntfs — these basically cover all popular choices for managing a storage node. If so, what would happen if a storage node could reasonably expect the file system to have these properties and not emulate a tree on its own?

I’ve collected some statistics using the setup I’ve described previously. The baseline for this experiment is the no-sync version of code. The only change in the “flat” version of code is that there is no two-letter subdirectory to store blobs, e.g. the example path above becomes just pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1.

Total time to run the log:

ext4 btrfs
baseline mean=34981 seconds, sd=874 seconds mean=34490 seconds, sd=195 seconds
flat mean=25438 seconds, sd=985 seconds mean=25379 seconds, sd=478 seconds

We see an about 40% speedup coming from the flat directory structure in both ext4 and btrfs, at the cost of slightly bigger variation in results. However, the second set of numbers might be more interesting.

Time to run a directory scan reproducing the file walker process:

ext4 btrfs
baseline mean=204 seconds, sd=10 seconds mean=1470 seconds, sd=35 seconds
flat mean=228 seconds, sd=18 seconds mean=769 seconds, sd=10 seconds

This is curious! Ext4 seems to like here what a storage node is doing (though the difference is not that big), but btrfs would really prefer having a flat structure! As such, the choice to have subdirectories may have an impact on the quality (raw performance? latency?) of the node’s operation.

The directory scan was performed while no other I/O activity is performed, this including handling normal requests. I do not have easy means to have a more faithful testing scenario, but I suspect that with regular activity the numbers for ext4 would similar for both the baseline and flat scenario, taking advantage of the fact that regular operation would just have a smaller impact on the file walker process.

Again, the results may be different for zfs and ntfs, and given that ext4 and btrfs already differs a lot, it’s impossible to generalize. Taking all of the above into account, I wish there was a switch in the storage node code to select either using a two-level directory structure or a flat structure, so that the storage node operator can decide on their own how to organize data. Alternatively, I’d probably prefer hardcoding the simpler, flat structure, as opposed to the current two-level structure, as by the numbers presented it seems safer for btrfs, while not impacting ext4 that much. Yet either of these options would require writing code to safely migrate data from the current structure, which is probably not trivial. As I’ve collected these statistics anyway, I’ve decided to show them in hope that maybe at some point they will be useful.


wow big, work, how did you tested it, any script or some software?

Please refer to the post I linked, I’ve described my setup there.

The current way helps to rsync the node faster (to copy it to another disk), because you can run rsync processes in parallel, splitting the two-letter directories among them.

Without the two-letter directories, it would be possible to generate a list of files and split that into however many parts to run in parallel, but the generation and just directory listing would take forever.

1 Like

Yeah, I understand this point of view. Though, it’s more a problem with rsync than with the directory structure. Supposedly rclone can do the same, but in parallel without having to explicitly run parallel tasks… will have to try it at some point.

I am sad to announce the death of the 1TB WD Scorpio Blue drive which participated in my tests. Constructed on March 14, 2012, shortly after the crisis of Thailand floods which made the HDD market very difficult, over its long life it honorably served as a laptop drive holding the most important OS files, an external drive taking care of entertainment of its owner, and, in its final months, made the best effort to provide accurate results of I/O benchmarks. Unfortunately, motor failure has prematurely interrupted this important work. Its sacrifice gave us invaluable data that will hopefully improve the performance of the Storj network. The drive is survived by many of its neighbors working hard on providing the best storage service to the whole world.


Sending condolences to you and the neighboring drives.
We thank it for its service :saluting_face: