It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected

I’m experimenting with vfs_cache_pressure=1 on one of my hosts. After leaving it overnight, almost all RAM reported as buffers was converted into cache. slabtop shows dentry as first sorted so there’s that.

Edit: had a memory panic, so back to defaults.

@elek
Hello.

I didn’t know that Ext4 directories contain ONLY file names. Is this really the case, and any other metadata(beyond names) on files stored on such nodes placed on Ext3/Ext4 partitions can ONLY be obtained by reading inodes?
If this is the case, then this explains the current implementation of storj filewalkers and the extremely slow operation of all filewalkers on nodes with a really large number of small files (tens of millions).

And your suggestion of creating and maintaining a custom separate database that caches needed part of file system metadata obtained from inodes is reasonable in this case and may be the best option for a future development of storagenode for Linux.

But I wanted to discuss that for other file systems, and in particular NTFS (which I think ranks 2nd after Ext4 in popularity and number of storage nodes running in the Storj network) its NOT the case! Also may be not the case for some other FS as well. I just do not have enough knowledge of it to speak here, so for now stick to NTFS only.

On NTFS, directory files ALREADY contain all the basic metadata info necessary for the filewalkers to work. At least its true to used-space filewalker and trash-deleter filewalker, I’m not sure right now about the garbage collector filewalker yet, because I don’t know its exact needs(what set of metadata does it really need).

And since this information stored in NTFS directory files is already structured and sorted (as far as I understand it, these are sorted b-trees) and due to fact that NTFS has some simple mechanics to minimize the fragmentation of these files (one of them is that when the size of the directory grows, additional space is allocated/reserved in relatively large chunks, it seems +200 kilobytes steps by default, which is enough for metadata on almost 1000 files).

This information can be read and processed VERY quickly without the need for any additional metadata caching. In fact, directories on NTFS are ready-made metadata caches (covering all the files stored in that directory) created and maintained by the file system (OS) itself. This was designed in such a way just exactly to avoid reading inodes (in NTFS, their equivalent is “file records” or “MFT entries”, which are stored in the $MFT - master file table) for most of simple tasks. And access inodes($MFT) only for more complex tasks (like getting access rights/security descriptors or reading actual file content which need a list of all extents/cluster chains there actual file data is stored), as well as in case of data modification.

I did some performance tests (+profile recording of all IO requests) on windows node + NTFS disk and gets results what a win native tools (such as dir, robocopy or just plain GUI file explorer) can beat current storj used space filewalker on tasks of calculation occupied space by /blobs/ structures of a large storj node like 100х or even more folds. If we speak of really large number of files: In the tests, I used my currently largest node on the NTFS, which now contains about 90 million files. But not so many in terms of their total volume - it’s 13.8 TB, so 90 mil files not the worst case scenario. Soon there will be some nodes with few hundred million files on the network.

I posted all result on github issues page: Storage node performance for filewalker is still extremely slow on large nodes · Issue #6998 · storj/storj · GitHub

Please check and evaluate it. Because I got the first reaction that supposedly your internal tests don’t confirm anything like that.

For me, judging by the profiling results, this is achieved by only two simple things:
1 - native NTFS utilities read ONLY directories - they do not access $MFT(inodes) at all
2 - they read directories in relatively large blocks (64 kb per IO request), whereas the storj filewalkers reads them in small blocks (4 kb per IO request)

Along with the fact that directory files are usually only slightly fragmented even on a highly fragmented FS and the fact that HDDs are really bad only for random reading, but they cope MUCH better with ~linear reading of large blocks, this gives a speed difference of several times and on reading directories too.

But the main problem is that storj filewalkers reads the $MFT entries of EACH and every file one at a time in one request. Which is EXTREMELY slow on HDDs and for large nodes can take WEEKS of real time for a single full pass. Which is unacceptable and will cause more and more problems soon - at some stage, not only the used space will cease to be adequately evaluated(its already the case), but even the GC and trash deleters will stop coping with the work.

The good news is that at least for NTFS, you don’t need to “reinvent the wheel” with a database of cached metadata. The solution is much simpler and more efficient - to read metadata from directories (and in large blocks, may be to push it even further compared to the standard Windows utilities do and read directories in blocks of say 1 MB vs 64 KB used by default in native NTFS utilities) and skip access to the MFT completely.
Even when/if you develop and can implement such a solution, most likely you will NOT need to use it on NTFS (and possibly some other FS).

Maybe for a more universal/cross-platform approach. But it may turn out to be still slower/more resource intensive too compared to native.
And in any case, it will not be any time soon, because it requires a rather long development/testing/implementation process. Whereas a fix for NTFS (and any other FS where directories contain a sufficient set of metadata) can be simple and fast to implement.

3 Likes

Void Everything is probably the gold standard for NTFS-tailored file traversal.

It just did a full re-index of 8TB StorJ files in 3-4 minutes - plus 3 minutes for quick sorting. I think my MFT (NTFS database) is around 60GB.

I believe NTFS attempts to limit how the MFT gets fragmented by reserving free blocks as it grows. The tool appears to read the whole thing sequentially at 150-200MB/s+ grabbing all the paths, sizes, created and modified dates along the way.

Could modern linux NTFS drivers be viable? This is such a enormous difference to the several days a file walker takes, causing all kinds of problems, that it might warrant exploring an alternate implementation. Maybe “File Sprinter”.

2 Likes

This is the data structure that stores directory entries. This is the data structure that stores file metadata. You can see that the best you can expect in directory entries is file type. Why is this so? Because of hardlinks: all hardlinks to the same file must share file metadata. If they were stored in directory entries, then they would have to be updated in each directory that has a hardlink to the file separately. File type is special in the sense it cannot be modified, which is why it’s safe to have a copy in the directory entry.

Also has hardlinks. I don’t know the data structures used in NTFS (I tried finding some public documentation, but failed), but I assume that given it also supports hardlinks, it should have the same problem. Though, if you can show documentation stating otherwise, I’d love to learn about this. Like, maybe it has some special case code for files that are only present once—I could believe that.

1 Like

Ha, it seems they do keep a copy of the file size in the directory entry, with exactly the drawback I mentioned:

NTFS duplicates the time stamp and file size information from the file’s MFT record. This technique, which is used by FAT and NTFS, requires updated information to be written in two places. Even so, it’s a significant speed optimization for directory browsing because it enables the file system to display each file’s time stamps and size without opening every file in the directory.

(source: NTFS On-Disk Structure | Microsoft Windows Internals (4th Edition): Microsoft Windows Server 2003, Windows XP, and Windows 2000)

Golang’s API that is based on usual POSIX approaches do not give the file size when listing directories, so this would mean Storj would have to implement a custom directory listing procedure just for Windows. Probably not that much of work.

1 Like

There is an ext4 feature dir_index that seems to store or helps to store metadata in the directory entries:
ext4(5) - Linux manual page

dir_index
Use hashed b-trees to speed up name lookups in large
directories. This feature is supported by ext3 and ext4
file systems, and is ignored by ext2 file systems.
But at least here I see that it is already enabled by default. I don’t know if it makes automatically use of it when getting the file information of if it requires some special way to access the file size from there.

However when reading this answer to exactly this problem that sounds like it was written by an SNO, the best solution to this problem could be indeed to use a separate database:

linux - Super slow Ext4 directory traverse - Super User

If you are trying to store a lot of small blobs, it might be that a database with the appropriate indexes, might be a better choice for your application.

1 Like

Maybe this gives some more ideas or background to improve performance on Linux:

3 Likes

which can be easily solved by adding more RAM or using an SSD for metadata (LVM).

yes, but Windows do not capable to use any amount of RAM to cache this, unfortunately. In in this comparison it’s a looser. (Please do not try to use NTFS under Linux, it will be even worse!). To fix this you need to use a third-party software like a Primocache. And, perhaps the SSD layer.
However, if you have an SSD, you still can configure the tiered Storage via PowerShell without requirement to use a Primocache (but yes, it can be configured only from the PowerShell, there is no GUI unlike a Primocache, but it’s your money at the end).

It still does not store file sizes.

2 Likes

I collected some more ideas. Please note, I am not a programmer just trying to be creative:

  1. Let me refine my other suggestion here:

The idea was to eliminate the need to query the inode additionally to the filename that we have already obtained. If this would slow down normal operations, then the idea of links (hard- resp. softlinks) came to my mind.
For a file like xwx3njdry2vhdthddq23773yfwcm4tv73qdpv65o7topw3pepq.sj1 store it like before with its normal name for all regular operations. But what if you also create a link to it? Let’s say in a “filewalker directory”. The name of the link could have added all the required information to it like xwx3njdry2vhdthddq23773yfwcm4tv73qdpv65o7topw3pepq.sj1.2024-02-07.2319872. For filewalking you would then traverse not the regular file directory but the directory with all the links and get the information from there from the names only.

  1. From the link I gave you before (https://home.simula.no/~paalh/students/CarlHenrikLunde.pdf) i read this:

In section 2.3.5, we explained that directory entries are stored in a random
order when there are many files in the same directory. One problem with
this is that if one wants to read all the inodes which the directory holds,
and read them in the order returned from the file system, the disk will
have to do a lot of seeking. Inode readahead tries to mitigate this issue.

This sounds like it would be good idea not to read a directory ordered by filenames but ordered by inode number. Then the potential of the inode read ahead cache could be maximised. Also from how I understand how the inode read ahead cache works, maybe there is a way to read the inodes first that are already cached in the page cache?
Here is an output with ls -i that shows the difference in inode number in a directory in contrast to the ordering by name:

Ouput with inode numbers
ls -l --format="single-column" -i ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/a2
809571697 22ajsbruqfnwbzhbpekhd4wiemjbtmbu4iemtjdqr6dfho5ada.sj1
725132895 22f3pwdlzcexq2ywbn4qqrefuhyvahvukdt2zsqqsj7jh7pyna.sj1
445920003 24k6zbo4aifeuttqq7kmq5pccqfzhosujnrps7qjn6qsm7fr5a.sj1
725133261 25yimhhtne7pr7r6dhljr5o4wve5qaetveb2iocl2fwtcx525q.sj1
725131332 26njkh37qpxebt2sootlqut2bujl4errzjh2lhxqxbru6kaz2a.sj1
725128253 26nwqseag4fmlb4pmbup5t47tuhzg6ks7pukofnu2yc47tdu7a.sj1
730059875 27cnbf7znvfxaifhtdgumkf4e7zpaskyvk4dbipnin6gp3ghcq.sj1
725126924 27xlnwcx6bgqp4hoz7zibch5utyv4ohfv4njgzmnihkmbs5rgq.sj1
793169579 2ab6xc3a7qledkqxnej7zcljgsd5jum5kuhnege273yalrlefq.sj1
725134662 2c4nkibbqcn25l7x2jwuwes7lnwmec5ga64jv56viq45nlrixa.sj1
725132506 2cgfaflqtzcp6stfaibsckdc3bidetqpvvcsj25lvelagvvbma.sj1
725130285 2ckbtsz2v2vdvrwmvl2mufzcbmxd3s4j56c4htuieanqkneszq.sj1
659432763 2d6jkosfch34eiumjx3xerk4vlesvhbzxtslywhezdnfchephq.sj1
793154683 2ebk35d2nsvsf3jtpvjay2m47iebycd6chgoxy2uhzwmw2uyja.sj1
725132280 2el3ykwhqvbcqkpx4fslzhdebuxhijro6ekimalqm4juitmtea.sj1
793156054 2epdpkp7mtqgjlmim5s7mv6brc6yqqk4f6crbkc2lwq5ffoyyq.sj1
379920172 2exmgeje43mmsfikcvoyrge22oqspgy64olevwgoilet67bmmq.sj1
725134842 2fx37hl4ipyhm6eawn2y6wu274m6uh2mz54veay5jaydjbjbcq.sj1
809577593 2g2lbr6foczk5c2vudw77wc7zdtt7nflb6fu75dbnqgs65fugq.sj1
809566962 2g3vmbnompelsskx5tb3syvdwc5qotyhlomngytdysuxlzvypa.sj1
809567708 2gb6zb7erpszeadsg3eg5ooak7rxwrzdmly665mnjmpto6ycoa.sj1
725130333 2hgefgltyv6k4u5ze4gm4ivpqhauiykstsiibn56klvbxulp4q.sj1
793153222 2ho2rebqhrmakkbekomzbldjknl2nvoh4hmd742apxtfvttrca.sj1
450747057 2hrvvzxmb54j47rvmnhkjl5xzfl4gzfhviiy7uhsiipmhlp65a.sj1
734642937 2iqjcp326joln2f26tcvsnesud5lu324d7yhpuvfouk4azufpa.sj1
725134110 2j2xmveakms46luhhvs2xw4esowrbxfef5bgehduelbqlzvusa.sj1
793171946 2j4xjhr5dpctly65o66hu2jyxroc5ou34wpcqbtj2qukja55za.sj1
725127551 2jginh36turfehf47x76j2zwfkkiia6dk6avllbc2qfyfkss7a.sj1
793174779 2k2atdgugjbynxva364vlfyuab7zrubev2iwuww4ntu6o65pcq.sj1
809571921 2kf6qprzaomkzlte6ugjaq3fa7ztiz7ej2bhjz5otefiazslzq.sj1
809568758 2kj72k7j7bszcww6auzqafappzghowamok7yczd6q2cz57b63a.sj1
326331158 2lreqjp52bxhphxq3mgb5mg5zfhjjoiaptcmnz2d2zz5szoguq.sj1
725134345 2mas5aav6zxrwpys2my2k3bmjmkjd6ajhq5sljjeba6yfxqxja.sj1
809571877 2ncavv4jic36h2p3qjmpr367grmp57buq3crlqlmtp4lotogea.sj1
809577265 2o6tzkwgyjal4zhzjn3w23rhqmq3pxnbhom2adnatarv6kxfda.sj1
725133559 2omi7vgbt6cxrxayijz4yeevejmb4gwfbcmgxr2h4rmnvjgcma.sj1
809571731 2opqgow2vav6gsgqtn6fuxlp64lvduhwdzlswxvouwhk3dggsa.sj1
793162013 2pcgkwoff7pymh2veiwn2olorwzkonhlbd5fcvfve5g552ar3q.sj1
793175415 2pemsxbn77zdqko6gxzi2egasa64zl7kqnk6it5bcpzu6ziw4a.sj1
466790388 2pjoludmzm5yiiyumxbfadkmv7l5ykqkcstnyxrttls66wbmdq.sj1
480787471 2pngzg6va2n3pnc7lqokhyz72n6zzc4wwn7xdrhip4c4eeswva.sj1
809576378 2qfp25yeq5zzkcbp35l2fdo4q6i7sycitis5q4vbx4filgizea.sj1
809577804 2qmaf2eknt6jvfdrvvjfjds4euahlqrk6bgr24tgr4csfhrtiq.sj1
793158778 2r7g4qvi3xhenfk4smg5p7vqvwultv4nfl32qc7o7y7j53dplq.sj1
809577597 2rq4yrmnifz6f7qhymgoe6mifee7eed72u2dwoz6o5oamwbtia.sj1
793177421 2rvwcxcj53rhxqdl27jzxgufjjq6mgfvfvqjiaodmmjhlhkz4a.sj1
793172583 2s4thwbvktfcaxobqeopngemczoqbeccgk544tuk4dzfrrdk5a.sj1
725134298 2s7jxpwolz57wq4azz6drck4mgahybngz4ozoiyxb5bow3xa4q.sj1
544793935 2ses4chgjgiihlly7eovivi6ed4oc47ehfwjbnmh7rzshzygyq.sj1
793177204 2t5we5onyk77z46jnc5jyq5ykfp3rvlt7ip3jarzmwied6xd2a.sj1
793163602 2u4w3vuivpyvzkxk3wxbtuqfrnydga7bcaqc3a6nvaqwduii3a.sj1
809567338 2ua25irb7ocpklsi5yu6zwphaitugyyfqf45hhty6csb2pv2ca.sj1
360553632 2ub55yjtg27456qzd4khf2lvo2w64qtk7h7b6jyxtpboeaawea.sj1
725131420 2umcvivu7op4kpqp3hduztqvq24xz3wfgzim6yqnwzwyl6bz5a.sj1
725133939 2umpnowbpa273m3ilcbybzmhn4ewe3bpfbqcqo3gc222wb4lcq.sj1
625060160 2uo7gbhtaf437qoe3rbe25sudiknqepb7zqwrisxzb5npv2otq.sj1
809577110 2uqtocmejc2py6ksgbqxxbpybh7aykdlvghxmbn6tnx5jp46xq.sj1
725125973 2uwogd7r6byarlamdneisdotser37l5tpqeujpetfhyqldjwba.sj1
734647060 2uzhyoggm3xdievixxquwh5gx2jnngobzvsb2rwq7nox26hlma.sj1
725126947 2vg6gljexku736gly6qvlvnpgib5nw44z2wt3m6yagefdmoepq.sj1
725133256 2wmfsz3gjad3wilm6yzslg7nzs5nshgfpgsusouht7cjwqk2ma.sj1
809576663 2xifmrqfikdwiso6ehmew4nlds5lhey3pfrer4mbxcpq3gc3cq.sj1
655833144 2xl4bf7aej6httkl47mwx6rodlgiamax57dzpag53p7xgwljwa.sj1
725131233 2xm7eicf2vw3yzh4repzlc3tzo5a6dyrtnlp5v34q4jowqkydq.sj1
643385867 2xoxezq4g7gtdcbvj5qs42rf7gizfzd3b3pskq7rlu6sumbsmq.sj1
725132860 2xy2sw3tegxilex7wzcp32czgspb7m36u5ourpr2rvrwqxferq.sj1
793174874 2yfwtdvsrsagbwzqnducngpy7q4onl3wwtcqnuugzflj4aqcea.sj1
809574548 2ywwpnueylk45urynl47fybrdivcuo4ew3or4lfdztxtoisfha.sj1
793162957 2zaqqfcl6yrgdcegielrt6mha2xoxchwukutpucr6exmh7kjta.sj1
725134773 32tzoi7jgtr35xnkpvtmyjsfwudxofug2sx3pzaixkq55hi4kq.sj1
725134161 33tqnknvplo7jugo35opkrc53xauzg3hnrdgi735rcei2xtzka.sj1
725131790 34ls3mrltorv7tcqsrvunfe4lu3d4ui5kae6mtgbjoxpeb3soq.sj1
725125411 34nbcryrahhvzzijdsxcu4a4whb5nwremnvilfienumx4dguuq.sj1
793163850 34v6cfqdo2nqirgk22fphz3fi6mucbm7c4cqcp4b334mhxayea.sj1
725134546 35xlbjadzpoatjtqw6kwfaiwhlaylpuwp4h23ysc2wrp7ig6gq.sj1
445925619 36famr6fggzpfd7dwkpfaqluao3nez3qu55vb46xq3xltlt3aa.sj1
725133614 36izs26ypnpjflksn3kzehnhvgr6oz2gjjkbphjf55vn6ofdaa.sj1
725131048 36owg7mbdrisjymswgglk6janep4t4dn3ic3ls3bzzh5iybxha.sj1
649577305 37o44y4svot3adwcia4mybj7g2kdwscipzsdmqsvg2rqy6emya.sj1
793149588 3arqdoatsqjxtu4bd4w3bqelhwq72ygrypdjmmklfvzuusn3ua.sj1
725133734 3be44oj73dpsspfv53pm5zmpmn7kf4hceaazum7st6qf57zzbq.sj1
588511720 3d5zt7piejcid6u2bwjydgzg7au6ir6v5jmhaol3totsojq5qq.sj1
213790804 3erniqjhssteiesb732uc5f4znwlcookku2rw2pb2ea4nh2vpa.sj1
793180129 3f543mskhd6b2ryu7mdzpgfdrvi7nl57qxjfkta4xfiwnaexoa.sj1
725131636 3fejrqmcj3qdthsoaog7s2hfu2vvt5hxedzn34egm7uumqt62a.sj1
793174193 3fzfwlyu4jdzcygeifm2o56z4g4ze5v727ieqfvs63z3bal2ua.sj1
793179286 3gjq4que47sy5xczugjkftgboujhpqu6z47pqmih6g6rodoyra.sj1
679029977 3glm4rlxd4q4vhljmjqarukhprxk7syv4snm2zfaabyvm3cgcq.sj1
734645269 3ihar46ymdx3n3mexuiqp4rhucs6mol5adakevgwkiiwiaroiq.sj1
734644148 3ixg6svb4mty2tpvbhlwugpju2262zte52bpdnpyn4gyrmhtoa.sj1
185707206 3izyyqss6av3akkvslkt7uqyh3fatc2jxrtyjvpqssr2bizvba.sj1
497634557 3jajto4thrsnppsm5n4s34seo7omv3vfqsgjynpk7e7ybo5a2q.sj1
169375943 3nquwqav5q7shdufghewdhbqikm3rbqdxpmhrczhc256vjjwuq.sj1
725132719 3o6rpqxpd3b43bl44acpix5kzink34pegkk4pdghklvilrduna.sj1
793167514 3obxj2ejuxkrbmryed3wptcfkk5gm3jnx2dciiqwew5f3jzxma.sj1
725130951 3onk2yug6om75eewfs3dxj35an634xfs6omlbkhxxaoho5n7ka.sj1
151746369 3p3ejwmc2dpkubjg3bhzgjc2fqzogv4chhz3xfhy6wdo67v27q.sj1
793167264 3q5f5moavavvncp7z7wxncjimjbukcbfauac5o5duw4bzgxlxa.sj1
793179050 3qg64vliej4l2vqwcbj5f3vfzndo5ll7j6r3qfjiio3fsyo76a.sj1
809572183 3reh6ac3ndlipcd5tnq5ef2e6vsmnexc7rivagfpfs5brsmydq.sj1
211505854 3rlxjjcx6z3zwnlp2wxavi3nb3i77f4sgpzfbb4vsnim26znya.sj1
793156714 3rmewjbu4y544ysagowndftm4rsw5epypf4uck23pandsddlja.sj1
809574609 3sqnu6mqp4ne5ppyvrga74wh557ssghzdu4vccvoz2ed3dkhaa.sj1
656720670 3ss7lorqjacf3keljbr4mdizbdkvwomn4rlqfcvsy67l5zekta.sj1
725133091 3tmqtqijxol7k4pmrc6mn4ci4bsh5kdjpnpghkscgix3gcqsaa.sj1
793174900 3usep3xe5pe6y3vxer44xrna2jt3vjjwiy6mjiafndo2psb6oq.sj1
619202286 3uucvzejgljyaiedgi2ghwivievpftik3if6dvwujkryc5rieq.sj1
793156421 3vhc4vdmp4dnrnausavwutc6qsbikp55fsrcfe4xpvdanivyfa.sj1
793161986 3vt5jftme2oornatbfvwj4pjlmljtewbfpq2gbqnzlgf2oevva.sj1
725134765 3w6yqjoimyukbieio4frpyzmx2hll32i3yyyxxr5voqfbdr5wa.sj1
242811316 3wuaxtojorth5g2c7mz25nwqhgdlzboym55xdcg2p2nr5am5ta.sj1
725123140 3ww6dmukub2ya5gpvzknhu5h3alguc6znvobiqebyw4v73be3a.sj1
793177445 3xzcjqvgxsvmstz7o7cz42dg6dsanlwx26vb7lt3lxzdoczdya.sj1
793179172 3ydhpzq37efnlp2ryuesj6vyvk7qud67264sekypxdpdwnpn2a.sj1
809575168 3ydravilmuzjmqeh7u5xsemfgamqvrzt7d744k6jridnevlg4q.sj1
725128948 42o6hzhzuk53kdkxfzkjxi5izkivni5e3xjgr7kejqrvzg4ljq.sj1
361164186 43hfq7vapo5s5md7ywwumwuq42ys64jz3vjqf3ppybbpi3jova.sj1
242811348 43hqlngmafvf2yblm3gq5xc3zamg3xby6auc2ulofabtgydzaq.sj1
809576625 44wwl6j56hr6fzwazbzhk53ayuy47g5n5hjjyv74irroehj2ha.sj1
242811358 45wvbg47y5uhiadof7wgdkcllkcpk4y3rlgty6igwysskop6la.sj1
725131363 45xk5efddto2jh44r6kp4xdxh3tbyoew4wpepvlbuvf26zwshq.sj1
225713862 46z4zsjbn4mpttdh3xxboe75vhf7gjn7e5dirxsuuundvsio5q.sj1
685785875 4bzyyfflvskrgayuthjsjs4auqjwmpym2ch6amg6lecm6lzzxq.sj1
467033465 4e4ytmh3324pbtufhdtfnpk5lx7hoh7lkemvubejuesff5mhjq.sj1
793177742 4eq26xrevkyehlr2dkeh4lo6kjrygetq7xt7vmu2e35v7ndydq.sj1
793161460 4fdz6qcyah4unv5cxmxeyylfuhcqkpg2uy3zzs5lft3diq2mmq.sj1
  1. Here is something that I have found about the directory sizes. I cannot assess the effect, maybe it does not have any. But maybe it is something that needs to be considered:
    https://blogs.oracle.com/linux/post/space-management-with-large-directories-in-ext4

Large directories in the ext4 filesystem face a unique problem when the majority of the files in the directory are deleted. This situation can cause functions like sys_getdents to experiance a significant slowdown, which might even cause the system to hang. This blog post explains why this occurs and what you can do if you encounter this problem. It’s important to note that this isn’t a bug but rather how ext4 manages large directories.

The issue here is that, ext4 doesn’t free the directory’s empty blocks. Therefore, once a directory’s size is grown, it cannot be shrunk.

The reason behind this is that the ext4 file system does not release its empty blocks back to the kernel. Therefore, the size of the directory does not reduce even after the files have been deleted.

What I see is that the 2-char subfolders in the blobs folders are created once but never deleted. So maybe they have this performance issue to?

  1. Make use of caches or other running file operations
    Linux caches a lot. Maybe there are ways to get more information from caches or operations that happen anyway in the background and use them for calculation instead of doing extra directory traversals. General idea would be, to try to get information from caches or other processes. I give you a theoretic example to make clear what I mean: Let’s say there is a process collector that has just iterated over the files of a directory. The file information is probably in the cache now. So it would be easy for any other process to get required information while the files are hot. So for example a used space calculator could kick in and read the file sizes of these files from the cache and store them somewhere for later use instead of running later some time and re-reading the files from disk. So basically whenever an inode is accessed if you record the size no mater if you need it for a specific operation, then over time you should have nearly all file sizes at hand to make used space calculations without running an additional file walker.

  2. Improve accessing algorithm of directory traversal
    This is something I got from a KI: Breadth-First Search
    I have no idea if this could be useful, maybe it is worth to look it up:

Breadth-First Search (BFS) is a graph traversal algorithm that explores a graph or a tree level by level, starting from a given source node. It's a popular algorithm for searching or traversing graph-like data structures, such as directory trees.

Here’s how BFS works:

  1. Choose a starting node (also called the source node): This is the node where the search begins.
  2. Explore the neighbors: Visit all the nodes that are directly connected to the starting node. These nodes are said to be at level 1.
  3. Explore the next level: Visit all the nodes that are connected to the nodes at level 1. These nodes are said to be at level 2.
  4. Repeat step 3: Continue exploring nodes at each subsequent level, until all nodes have been visited.

The key characteristics of BFS are:

  • Level by level exploration: BFS explores the graph level by level, starting from the source node.
  • Queue-based implementation: BFS typically uses a queue data structure to keep track of nodes to visit. Nodes are added to the queue as they are discovered, and removed from the queue as they are visited.
  • Complete exploration: BFS guarantees that all nodes in the graph will be visited, as long as the graph is connected.

In the context of directory traversal, BFS can be used to traverse a directory tree by treating each directory as a node, and the subdirectories and files as edges. The algorithm starts at the root directory and explores all subdirectories and files at each level, before moving on to the next level.

BFS has several advantages, including:

  • Efficient: BFS can be more efficient than recursive algorithms, especially for large graphs or directory trees.
  • Easy to implement: BFS is a relatively simple algorithm to implement, especially when using a queue-based approach.
  • Complete coverage: BFS guarantees that all nodes in the graph will be visited, which is important for tasks like directory traversal.

However, BFS also has some limitations, such as:

  • Memory usage: BFS can require a significant amount of memory to store the queue of nodes to visit, especially for very large graphs or directory trees.
  • Slow for very deep graphs: BFS can be slow for graphs or directory trees with a very large depth, as it needs to explore all nodes at each level before moving on to the next level.

Overall, BFS is a powerful algorithm for graph traversal and directory traversal, and is often used in many applications, including file systems, web crawlers, and social network analysis.

  1. Log parsing
    At least in my logs the size of an uploaded files gets displayed:
    2024-07-10T04:54:27Z INFO piecestore uploaded { ... "Size": 1536}
    Can’t we parse the sizes from the logs as they are written?

  2. And maybe to add optimize read ahead:
    This is also from https://home.simula.no/~paalh/students/CarlHenrikLunde.pdf
    Of course I can not tell if you make already use of it or if it could have any positive effect at all.

In each struct file there is accounting data for the readahead algorithm, as show in listing 2.6. An application may influence the behaviour of the readahead algorithm by explicitly hinting what kind of I/O pattern it has. The system call posix fadvise is used for declaring the file access pattern. Three of the flags are of interest for us here:
POSIX FADV NORMAL The default behaviour. Heuristics will determine if readahead is to be done.
POSIX FADV SEQUENTIAL The application notifies that accesses are
expected to be sequential, so the Linux kernel doubles the maximum
readahead size.
POSIX FADV RANDOM Accesses are expected to be random so readahead will be disabled.

3 Likes

Is this still being worked on:

https://review.dev.storj.io/c/storj/storj/+/12806

There seems to be zero progress for almost 6 weeks.