On the impact of two-level directory structure for storing blobs

Toyoo · October 1, 2022, 7:23pm

Following my previous experiments, I’ve tested whether it makes sense to have two-level directory structure in the storage directory. Currently the node will store each blob under a path that looks like: pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz/67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1. This means that a satellite whose encoded ID is pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa requested to store a piece whose encoded ID is fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq. Note how the first two characters of a piece ID is used as a subdirectory name. This is a long-used practice aiming to avoid having directories with too many files: by splitting a collecting between many subdirectories, no single directory will have a large amount of files.

This practice was invented for two reasons. First is that some file systems put an explicit limit on the number of files in a single directory. For example, FAT32 limits the number of files in a single directory to 65535, which is definitelly not enough for a storage node, which may manage millions of files. Second is that for some file systems, even if they allow a large number of files in a single directory, their operation becomes slow when this happens, e.g. because finding a file in a directory requires a linear scan of the directory contents.

However, all file systems considered modern implement efficient data structures for managing large directories, usually in form of a b-tree. This means that neither of the problems described above should happen there. Some of these file systems are ext4, btrfs, zfs, ntfs — these basically cover all popular choices for managing a storage node. If so, what would happen if a storage node could reasonably expect the file system to have these properties and not emulate a tree on its own?

I’ve collected some statistics using the setup I’ve described previously. The baseline for this experiment is the no-sync version of code. The only change in the “flat” version of code is that there is no two-letter subdirectory to store blobs, e.g. the example path above becomes just pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/fz67lt3nidku7dhqmok4wkr77j5evty4l7lusy3jwd5iet3xjpgq.sj1.

Total time to run the log:

	ext4	btrfs
baseline	mean=34981 seconds, sd=874 seconds	mean=34490 seconds, sd=195 seconds
flat	mean=25438 seconds, sd=985 seconds	mean=25379 seconds, sd=478 seconds

We see an about 40% speedup coming from the flat directory structure in both ext4 and btrfs, at the cost of slightly bigger variation in results. However, the second set of numbers might be more interesting.

Time to run a directory scan reproducing the file walker process:

	ext4	btrfs
baseline	mean=204 seconds, sd=10 seconds	mean=1470 seconds, sd=35 seconds
flat	mean=228 seconds, sd=18 seconds	mean=769 seconds, sd=10 seconds

This is curious! Ext4 seems to like here what a storage node is doing (though the difference is not that big), but btrfs would really prefer having a flat structure! As such, the choice to have subdirectories may have an impact on the quality (raw performance? latency?) of the node’s operation.

The directory scan was performed while no other I/O activity is performed, this including handling normal requests. I do not have easy means to have a more faithful testing scenario, but I suspect that with regular activity the numbers for ext4 would similar for both the baseline and flat scenario, taking advantage of the fact that regular operation would just have a smaller impact on the file walker process.

Again, the results may be different for zfs and ntfs, and given that ext4 and btrfs already differs a lot, it’s impossible to generalize. Taking all of the above into account, I wish there was a switch in the storage node code to select either using a two-level directory structure or a flat structure, so that the storage node operator can decide on their own how to organize data. Alternatively, I’d probably prefer hardcoding the simpler, flat structure, as opposed to the current two-level structure, as by the numbers presented it seems safer for btrfs, while not impacting ext4 that much. Yet either of these options would require writing code to safely migrate data from the current structure, which is probably not trivial. As I’ve collected these statistics anyway, I’ve decided to show them in hope that maybe at some point they will be useful.

Vadim · October 1, 2022, 8:15pm

wow big, work, how did you tested it, any script or some software?

Toyoo · October 1, 2022, 8:33pm

Please refer to the post I linked, I’ve described my setup there.

Pentium100 · October 5, 2022, 11:54am

The current way helps to rsync the node faster (to copy it to another disk), because you can run rsync processes in parallel, splitting the two-letter directories among them.

Without the two-letter directories, it would be possible to generate a list of files and split that into however many parts to run in parallel, but the generation and just directory listing would take forever.

Toyoo · October 5, 2022, 1:17pm

Yeah, I understand this point of view. Though, it’s more a problem with rsync than with the directory structure. Supposedly rclone can do the same, but in parallel without having to explicitly run parallel tasks… will have to try it at some point.

Toyoo · October 24, 2022, 6:35pm

I am sad to announce the death of the 1TB WD Scorpio Blue drive which participated in my tests. Constructed on March 14, 2012, shortly after the crisis of Thailand floods which made the HDD market very difficult, over its long life it honorably served as a laptop drive holding the most important OS files, an external drive taking care of entertainment of its owner, and, in its final months, made the best effort to provide accurate results of I/O benchmarks. Unfortunately, motor failure has prematurely interrupted this important work. Its sacrifice gave us invaluable data that will hopefully improve the performance of the Storj network. The drive is survived by many of its neighbors working hard on providing the best storage service to the whole world.

bre · October 25, 2022, 4:00am

Sending condolences to you and the neighboring drives.
We thank it for its service

JWvdV · March 23, 2024, 9:12pm

Related to many questions lately, concerning the filewalker I wondered whether it would make a difference if we would try to flatten the directory structure a little bit. Also after reading Benchmark: Deep directory structure vs. flat directory structure to store millions of files on ext4 | by hartator | Medium.

In the blob folder, the first descendents are the satellite-folders, the second descendants are [a-z0-9][a-z0-9] (so permutation of maximal 1296 folders), the third descendants are the real blobs with up to some thousands per folder.

Totally flattening will probably not be possible, since you can have a maximum of about 10E6 files in a folder in ext4 (4.2E9 in NTFS) limiting the amount of data per satellite to 1.2-1.5 TB.

So:

Is there a reason why has been chosen for this directory structure?
Would it matter whether the second descendants are one or two chars?
Would there be difference in speed of the filewalker? For ext4 there probably is. But would it also be on NTFS? Unfortunately I don’t have laying a disk around to benchmark myself atm.

Toyoo · March 23, 2024, 9:25pm

If you enable the large_dir ext4 feature, the limit is in billions, so no longer relevant.

For your first question, we’d have to ask Storj developers, but it’s certainly a common practice in software development to have multilevel directory structure if storing a lot of files.

For your second question, for ext4 at least in theory it should be somewhere between a flat structure and the current one.

For your third question, see my benchmarks. I don’t do Windows, so no NTFS there though.

JWvdV · March 23, 2024, 9:42pm

Apparently you did ask the same question, although the search engin didn’t find your topic on ‘flat deep directory’.

But indeed:

I would say, an option to tell how many characters of the blob-name should be converted to folder name; as to say it should be 0 on ext4 and btrfs. But should actually be at least 3 on those using FAT32 (if any). Given the forms of trees used in XFS and NTFS (actually the same B+tree), I would expect them to be have a better speed using a flat directory structure as well.

Great addition.

Alexey · March 24, 2024, 8:29am

The generic answer is - we adapted the storage structure to be able to work on almost any known local filesystems. The hierarchical structure usually do not hit any limitations (or at least most of them do not have the issue with the number or size of the pieces…).
I personally think that this PoC could be useful:

storaje · June 24, 2024, 3:10pm

I can only see this being helpful on painfully slow hardware where CPU hashing files is the bottleneck. From what I’ve seen the true bottleneck is random IO on the storage hardware.

If only sync was intelligent enough to copy files sequentially as stored on the disk.

Pentium100 · June 24, 2024, 5:29pm

From my experience, rsync is slow when copying lots of small files and can be made faster by running multiple rsync processes in parallel. Probably file transfer start/stop takes some time which makes rsync unable to saturate the connection of the disk IO.

storaje · June 24, 2024, 7:30pm

It looks like others have had the same thought processes: Parallel copy / rsync

storaje · June 24, 2024, 7:34pm

Of course there is also dd but it requires your node to be down for the copy.

One of my nodes is currently running off a loop mounted partition image made with ddrescue while the replacement disk is in transit.

Pentium100 · June 24, 2024, 7:49pm

dd is good, but only if you do not need to move the node to another filesystem, say from ext4 to zfs or from ext4 to ext4 but with different settings.

Alexey · June 25, 2024, 7:03am

You may also try rclone sync, it doing copying in a parallel.

Alexey · June 25, 2024, 7:06am

I would suggest to use LVM before ext4, it would allow to move data much more faster between disks.
See

However, you likely use ZFS, so for you it should be simple too, just send and receive a snapshot, then stop the node and send another one to finish a migration, then start the node on a new dataset.

Pentium100 · June 25, 2024, 7:38am

Yes, this works well, so do lvm snapshots (though in the past having a lvm snapshot made the IO slow down significantly). It is also possible to stop the node and use dd.
However, sometimes you may need to move the node to a different filesystem. I actually had to do this.
When I was creating my node (at the beginning of v3), I read that the node may store very tiny files, so I formatted the virtual disk with the setting that created lots of inodes (so I would not run out of inodes storing 100B files). Later, it turned out that while the node is storing some small files, on average the files are bigger, so there is no need for so many inodes.
Anyway, when I tried to expand the virtual disk, I ran into inode count limit (ext4 can have limited amount of total inodes). There was no way around it except create a new virtual disk, format it with the correct setting (fewer inodes) and then copy the files to it. dd or snapshots would not have helped here as they would just copy the whole filesystem with the same problem.

Alexey · June 25, 2024, 7:41am

You need to use a thin provisioning, then the LVM snapshots will behave almost like ZFS ones, just try it, it’s very fast.

…or do not change defaults… Our tests do not use any special options…

that’s true.