@elek
Hello.
I didn’t know that Ext4 directories contain ONLY file names. Is this really the case, and any other metadata(beyond names) on files stored on such nodes placed on Ext3/Ext4 partitions can ONLY be obtained by reading inodes?
If this is the case, then this explains the current implementation of storj filewalkers and the extremely slow operation of all filewalkers on nodes with a really large number of small files (tens of millions).
And your suggestion of creating and maintaining a custom separate database that caches needed part of file system metadata obtained from inodes is reasonable in this case and may be the best option for a future development of storagenode for Linux.
But I wanted to discuss that for other file systems, and in particular NTFS (which I think ranks 2nd after Ext4 in popularity and number of storage nodes running in the Storj network) its NOT the case! Also may be not the case for some other FS as well. I just do not have enough knowledge of it to speak here, so for now stick to NTFS only.
On NTFS, directory files ALREADY contain all the basic metadata info necessary for the filewalkers to work. At least its true to used-space filewalker and trash-deleter filewalker, I’m not sure right now about the garbage collector filewalker yet, because I don’t know its exact needs(what set of metadata does it really need).
And since this information stored in NTFS directory files is already structured and sorted (as far as I understand it, these are sorted b-trees) and due to fact that NTFS has some simple mechanics to minimize the fragmentation of these files (one of them is that when the size of the directory grows, additional space is allocated/reserved in relatively large chunks, it seems +200 kilobytes steps by default, which is enough for metadata on almost 1000 files).
This information can be read and processed VERY quickly without the need for any additional metadata caching. In fact, directories on NTFS are ready-made metadata caches (covering all the files stored in that directory) created and maintained by the file system (OS) itself. This was designed in such a way just exactly to avoid reading inodes (in NTFS, their equivalent is “file records” or “MFT entries”, which are stored in the $MFT - master file table) for most of simple tasks. And access inodes($MFT) only for more complex tasks (like getting access rights/security descriptors or reading actual file content which need a list of all extents/cluster chains there actual file data is stored), as well as in case of data modification.
I did some performance tests (+profile recording of all IO requests) on windows node + NTFS disk and gets results what a win native tools (such as dir, robocopy or just plain GUI file explorer) can beat current storj used space filewalker on tasks of calculation occupied space by /blobs/ structures of a large storj node like 100х or even more folds. If we speak of really large number of files: In the tests, I used my currently largest node on the NTFS, which now contains about 90 million files. But not so many in terms of their total volume - it’s 13.8 TB, so 90 mil files not the worst case scenario. Soon there will be some nodes with few hundred million files on the network.
I posted all result on github issues page: Storage node performance for filewalker is still extremely slow on large nodes · Issue #6998 · storj/storj · GitHub
Please check and evaluate it. Because I got the first reaction that supposedly your internal tests don’t confirm anything like that.
For me, judging by the profiling results, this is achieved by only two simple things:
1 - native NTFS utilities read ONLY directories - they do not access $MFT(inodes) at all
2 - they read directories in relatively large blocks (64 kb per IO request), whereas the storj filewalkers reads them in small blocks (4 kb per IO request)
Along with the fact that directory files are usually only slightly fragmented even on a highly fragmented FS and the fact that HDDs are really bad only for random reading, but they cope MUCH better with ~linear reading of large blocks, this gives a speed difference of several times and on reading directories too.
But the main problem is that storj filewalkers reads the $MFT entries of EACH and every file one at a time in one request. Which is EXTREMELY slow on HDDs and for large nodes can take WEEKS of real time for a single full pass. Which is unacceptable and will cause more and more problems soon - at some stage, not only the used space will cease to be adequately evaluated(its already the case), but even the GC and trash deleters will stop coping with the work.
The good news is that at least for NTFS, you don’t need to “reinvent the wheel” with a database of cached metadata. The solution is much simpler and more efficient - to read metadata from directories (and in large blocks, may be to push it even further compared to the standard Windows utilities do and read directories in blocks of say 1 MB vs 64 KB used by default in native NTFS utilities) and skip access to the MFT completely.
Even when/if you develop and can implement such a solution, most likely you will NOT need to use it on NTFS (and possibly some other FS).
Maybe for a more universal/cross-platform approach. But it may turn out to be still slower/more resource intensive too compared to native.
And in any case, it will not be any time soon, because it requires a rather long development/testing/implementation process. Whereas a fix for NTFS (and any other FS where directories contain a sufficient set of metadata) can be simple and fast to implement.