No unfortunately this did not convince me either. I don’t think that analogy fits at least for 2 reasons:
First reason is that it does not reflect my suggestion. Second reason is, that based on my observation this is not what the code is doing.
- Despite my doubts, I have run the command, results are:
time ls -lRU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l
5613197
real 584m5.243s
user 0m31.431s
sys 10m56.406s
time ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l
5613197
real 770m9.466s
user 0m37.127s
sys 12m55.709s
Yes, there is a difference between sorted and unsorted and I have said it above based on what I have read on the internet, that with a large number of files sorting seems to be slower than unsorted.
But the question is, does this reflect what I have suggested? And the answer is no, it does not. Because the sorting command you have suggested also sorts the files by their dates as it can be proved:
ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/:
ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/:
total 64
drwx--S--- 38 root root 4096 Jun 21 15:56 2024-06-19
drwx--S--- 51 root root 4096 Jun 28 17:27 2024-06-25
drwx--S--- 22 root root 4096 Jul 4 02:42 2024-06-28
drwx--S--- 9 root root 4096 Jul 5 00:49 2024-07-04
drwx--S--- 104 root root 4096 Jul 7 16:16 2024-07-06
drwx--S--- 40 root root 4096 Jul 8 18:00 2024-07-07
drwx--S--- 4 root root 4096 Jul 8 23:33 2024-07-08
/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-06-19:
total 32456
drwx--S--- 2 root root 860160 Jun 19 20:52 am
drwx--S--- 2 root root 851968 Jun 19 21:14 an
drwx--S--- 2 root root 835584 Jun 19 21:48 ao
drwx--S--- 2 root root 823296 Jun 19 22:28 ap
drwx--S--- 2 root root 847872 Jun 19 23:04 aq
drwx--S--- 2 root root 847872 Jun 20 00:00 ar
drwx--S--- 2 root root 851968 Jun 20 00:48 as
drwx--S--- 2 root root 847872 Jun 20 01:41 at
drwx--S--- 2 root root 851968 Jun 20 02:34 au
drwx--S--- 2 root root 851968 Jun 20 03:27 av
drwx--S--- 2 root root 843776 Jun 20 04:05 aw
drwx--S--- 2 root root 843776 Jun 20 05:04 ax
drwx--S--- 2 root root 856064 Jun 20 06:02 ay
drwx--S--- 2 root root 843776 Jun 20 07:08 az
drwx--S--- 2 root root 868352 Jun 20 08:27 a2
drwx--S--- 2 root root 856064 Jun 20 09:42 a3
drwx--S--- 2 root root 860160 Jun 20 11:03 a4
/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-06-19/am:
total 863280
-rw------- 1 root root 6400 May 15 2022 bafdu7sdqdtgit5w5a3mxfhscmwbgkdqpxeujooskye3bd6goq.sj1
-rw------- 1 root root 148224 Jun 6 2022 wlqqou3vptvlvcprhfnqrrifvau5gd4v6h23h7naub22fzbobq.sj1
-rw------- 1 root root 178944 Jun 15 2022 x2vpu4bhftpifk7mqzt6yffw2wxiwefxgzgn2gpik3efzc5roa.sj1
-rw------- 1 root root 4096 Jun 21 2022 rwsezjcfdmmwe74ufej27go6cfxze4usu3pbon53v2asssr5sq.sj1
-rw------- 1 root root 9472 Jul 3 2022 smvmvd6yo2e7shulco2ry73lglx7mm45lcsw2ozzymbeewlvpq.sj1
-rw------- 1 root root 65792 Jul 17 2022 7iimrn3xmhlvryolt5ljmiuriow5znzv4kg7k6sgmhmqm5iesa.sj1
-rw------- 1 root root 74496 Aug 5 2022 yqmnkk4gdtutopzygjj7lunuhmso63b3afqixg23pktduv6n6q.sj1
-rw------- 1 root root 74240 Aug 5 2022 smjupeyewoy55jxsbpxtup2q2zib5t7rvnxrdblwn2zgr6thmq.sj1
-rw------- 1 root root 74240 Aug 5 2022 q3v7utalcskoad76yt6refplszu7glkn7gwktgl37h6ed72dja.sj1
-rw------- 1 root root 74240 Aug 6 2022 lotuakuok3rhlyas2rwetlqbp7m3fjknfohzazxfjny7ucmyzq.sj1
-rw------- 1 root root 74496 Aug 13 2022 cahanqsuzk4lhehmeo2jsgnf4udbjagr3qgp3eeqsgqcc6heuq.sj1
-rw------- 1 root root 74240 Aug 18 2022 axkvpyjaj2e3pcfinfosoumzckccnxtlyyb44fao6ob55txfaa.sj1
-rw------- 1 root root 74240 Aug 23 2022 e3rwrzakhwlahvbjp5p473nc32dgswpst4zslkssirmmrs4pya.sj1
-rw------- 1 root root 77056 Aug 23 2022 g2ng24je2eaf4nmq54sspafbgo4gejjwvw2dxecqbvoqjswc2q.sj1
-rw------- 1 root root 73984 Aug 23 2022 un4xy6qfuegewxyaxeicn3sowz7ra4hh5drnyjt7bvjjlgt7eq.sj1
-rw------- 1 root root 74496 Aug 24 2022 msb2o24cotfqpqubpifwnmtnwhppnmsmfsqn5bpm5tfnhoa44a.sj1
-rw------- 1 root root 37376 Aug 24 2022 ffrdvmiyzjzv4gflfzvr3xtmylzoopwi2hg5zwpclomz67i3gq.sj1
-rw------- 1 root root 399616 Aug 25 2022 aukvwiindjrwrhtfnczndtpfyfa42gl57jiwa4ugap2nicqbja.sj1
-rw------- 1 root root 399616 Aug 26 2022 qznqozqp2b23rdumxc34dyrkjgjdzrsbz2z4z6bkb5v47p5wpq.sj1
-rw------- 1 root root 74240 Aug 26 2022 ancenmhkvdf6jtu52tiatowhvrmhu45e4kytqe5jatrlyxcguq.sj1
-rw------- 1 root root 74240 Aug 27 2022 ljgvfa6hxo5caut5nz76l4mlnx4r2j2jsbhv76vofzna7yv3xa.sj1
-rw------- 1 root root 27392 Aug 27 2022 5k65usvu43lf6rrnqe3c34p3l4diq6jc4ylz2npyv4yesiwz3q.sj1
-rw------- 1 root root 74240 Aug 27 2022 d5iq6keurs2pmyqmng3ptrciiza4jxofprnmm462j634ssysbq.sj1
-rw------- 1 root root 74240 Aug 29 2022 pfyqu5hawoh7creiajd7ft4qspemcj4ot4vfkwegirqfllutja.sj1
-rw------- 1 root root 2319872 Sep 8 2022 oh4mfjfovat22o62mcoylpjh6rzbvbx5jcuill7bstzebfjjvq.sj1
-rw------- 1 root root 2319872 Sep 9 2022 omijm6pn7qos7z2bodhrggk5j5vjnpivzftejx7vumzdv6d4nq.sj1
-rw------- 1 root root 5120 Sep 11 2022 4kwcsbzpl4mmqzps2sucpfnohm63cpsm26u2h7ikalna35ffka.sj1
-rw------- 1 root root 2319872 Sep 16 2022 afakiiinkfc6myt6oxme4h4a2rsstiftjgykoopgbqwb3xz5wa.sj1
-rw------- 1 root root 158464 Sep 22 2022 spk2zlj7tw6yyfhw22ibbeaejbktvuapjgwtanxl642thu4obq.sj1
I have to repeat it again, that my suggestion is not to sort the piece files by date. Not even the 2-letter subdirectories. Only the date folders. So I am still convinced that your analogy does not prove that sorting of date folders will have or would have a performance disadvantage.
There is also other evidence for that and that leads to my second reason:
- The code does not do what you are suggesting in 2 different ways:
- Your command processes the entire namespace path which includes all date folders, all 2-letter subfolders and all pieces. I did not find evidence that the code is doing that. This also does not makes sense from a logical standpoint. You cannot pass the entire namespace because you have to check the age of the date folder if it is old enough for deletion. So your command would only apply inside of the date folders but this is already after the sorting and selecting the date folder as I have suggested.
- I have observed what the processing order is and it is unsorted on the date folders, then inside the date folders the 2-letter subfolders get sorted by modified date and lastly the pieces inside the 2-letter subfolders get deleted by unsorted order again. So your suggested command that sorts by modified time and that you claim to slow down everything is partly exactly what the storagenode code is doing when it is processing the 2-letter subfolders: It sorts them by date. This proves that there is no performance penalty in doing so and it is also strong evidence that my suggestion to sort only the date folders would have no negative performance impact as well.
This is the unsorted output of the date folders which is the order they get processed
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
total 204
drwx--S--- 32 root root 4096 Aug 30 04:23 2024-08-29
drwx--S--- 21 root root 4096 Aug 29 01:12 2024-08-28
drwx--S--- 196 root root 4096 Aug 1 02:59 2024-07-31
drwx--S--- 42 root root 4096 Jul 9 03:04 2024-07-08
drwx--S--- 21 root root 4096 Jul 25 03:23 2024-07-24
drwx--S--- 52 root root 4096 Jul 7 03:11 2024-07-06
drwx--S--- 475 root root 12288 Aug 4 02:08 2024-08-03
drwx--S--- 554 root root 12288 Aug 17 01:25 2024-08-16
drwx--S--- 311 root root 4096 Aug 22 02:52 2024-08-21
drwx--S--- 63 root root 4096 Jul 29 23:06 2024-07-29
drwx--S--- 354 root root 12288 Aug 5 03:12 2024-08-04
drwx--S--- 22 root root 4096 Aug 28 02:20 2024-08-27
drwx--S--- 450 root root 12288 Aug 26 02:59 2024-08-25
drwx--S--- 423 root root 12288 Aug 10 02:14 2024-08-09
drwx--S--- 198 root root 4096 Aug 21 02:54 2024-08-20
drwx--S--- 21 root root 4096 Jul 14 02:43 2024-07-13
drwx--S--- 146 root root 4096 Aug 6 02:50 2024-08-05
drwx--S--- 130 root root 4096 Aug 25 03:11 2024-08-24
drwx--S--- 9 root root 4096 Jul 23 13:22 2024-07-23
drwx--S--- 70 root root 4096 Jul 27 02:53 2024-07-26
drwx--S--- 44 root root 4096 Sep 1 09:07 2024-09-01
drwx--S--- 14 root root 4096 Aug 11 01:24 2024-08-10
drwx--S--- 56 root root 4096 Jul 6 03:15 2024-07-05
drwx--S--- 27 root root 4096 Jul 23 03:03 2024-07-22
drwx--S--- 12 root root 4096 Aug 24 02:55 2024-08-23
drwx--S--- 14 root root 4096 Aug 19 00:02 2024-08-18
drwx--S--- 417 root root 12288 Sep 1 02:54 2024-08-31
drwx--S--- 256 root root 4096 Aug 7 03:24 2024-08-06
drwx--S--- 222 root root 4096 Aug 31 03:00 2024-08-30
drwx--S--- 101 root root 4096 Aug 9 03:13 2024-08-08
drwx--S--- 37 root root 4096 Jul 8 02:43 2024-07-07
drwx--S--- 19 root root 4096 Aug 26 23:33 2024-08-26
drwx--S--- 36 root root 4096 Aug 15 02:14 2024-08-14
drwx--S--- 260 root root 4096 Aug 8 02:34 2024-08-07
drwx--S--- 22 root root 4096 Jul 16 02:20 2024-07-15
drwx--S--- 33 root root 4096 Jul 13 01:27 2024-07-12
drwx--S--- 22 root root 4096 Jul 22 02:22 2024-07-21
drwx--S--- 11 root root 4096 Aug 17 23:49 2024-08-17
drwx--S--- 34 root root 4096 Aug 23 01:06 2024-08-22
This is the unsorted output of the 2-letters subfolders of a date folder. This is NOT the order they get processed
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-08-29/
total 64648
drwx--S--- 2 root root 1236992 Aug 29 19:38 zl
drwx--S--- 2 root root 1245184 Aug 29 12:42 ze
drwx--S--- 2 root root 1257472 Aug 29 04:20 yu
drwx--S--- 2 root root 1273856 Aug 30 04:23 zv
drwx--S--- 2 root root 1961984 Aug 29 07:03 yx
drwx--S--- 2 root root 1961984 Aug 29 22:59 zp
drwx--S--- 2 root root 1925120 Aug 29 09:47 za
drwx--S--- 2 root root 1937408 Aug 29 19:08 zk
drwx--S--- 2 root root 1929216 Aug 29 06:17 yw
drwx--S--- 2 root root 1925120 Aug 29 21:40 zn
drwx--S--- 2 root root 2490368 Aug 30 03:13 zt
drwx--S--- 2 root root 2560000 Aug 29 22:25 zo
drwx--S--- 2 root root 2523136 Aug 29 10:25 zb
drwx--S--- 2 root root 2543616 Aug 29 14:52 zg
drwx--S--- 2 root root 2555904 Aug 30 03:59 zu
drwx--S--- 2 root root 2547712 Aug 29 17:03 zi
drwx--S--- 2 root root 2551808 Aug 29 12:34 zd
drwx--S--- 2 root root 2551808 Aug 29 11:20 zc
drwx--S--- 2 root root 835584 Aug 29 03:26 yt
drwx--S--- 2 root root 2560000 Aug 30 02:33 zs
drwx--S--- 2 root root 2547712 Aug 29 07:59 yy
drwx--S--- 2 root root 2564096 Aug 29 13:59 zf
drwx--S--- 2 root root 2584576 Aug 29 05:28 yv
drwx--S--- 2 root root 2555904 Aug 30 01:12 zr
drwx--S--- 2 root root 2592768 Aug 29 15:59 zh
drwx--S--- 2 root root 2551808 Aug 29 20:44 zm
drwx--S--- 2 root root 2609152 Aug 30 00:16 zq
drwx--S--- 2 root root 2576384 Aug 29 18:19 zj
drwx--S--- 2 root root 2576384 Aug 29 09:25 yz
drwx--S--- 2 root root 2441216 Aug 30 05:10 zw
However this is the real order the 2-letter subfolders get processed
ls -1ltr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-08-29/
total 64648
drwx--S--- 2 root root 835584 Aug 29 03:26 yt
drwx--S--- 2 root root 1257472 Aug 29 04:20 yu
drwx--S--- 2 root root 2584576 Aug 29 05:28 yv
drwx--S--- 2 root root 1929216 Aug 29 06:17 yw
drwx--S--- 2 root root 1961984 Aug 29 07:03 yx
drwx--S--- 2 root root 2547712 Aug 29 07:59 yy
drwx--S--- 2 root root 2576384 Aug 29 09:25 yz
drwx--S--- 2 root root 1925120 Aug 29 09:47 za
drwx--S--- 2 root root 2523136 Aug 29 10:25 zb
drwx--S--- 2 root root 2551808 Aug 29 11:20 zc
drwx--S--- 2 root root 2551808 Aug 29 12:34 zd
drwx--S--- 2 root root 1245184 Aug 29 12:42 ze
drwx--S--- 2 root root 2564096 Aug 29 13:59 zf
drwx--S--- 2 root root 2543616 Aug 29 14:52 zg
drwx--S--- 2 root root 2592768 Aug 29 15:59 zh
drwx--S--- 2 root root 2547712 Aug 29 17:03 zi
drwx--S--- 2 root root 2576384 Aug 29 18:19 zj
drwx--S--- 2 root root 1937408 Aug 29 19:08 zk
drwx--S--- 2 root root 1236992 Aug 29 19:38 zl
drwx--S--- 2 root root 2551808 Aug 29 20:44 zm
drwx--S--- 2 root root 1925120 Aug 29 21:40 zn
drwx--S--- 2 root root 2560000 Aug 29 22:25 zo
drwx--S--- 2 root root 1961984 Aug 29 22:59 zp
drwx--S--- 2 root root 2609152 Aug 30 00:16 zq
drwx--S--- 2 root root 2555904 Aug 30 01:12 zr
drwx--S--- 2 root root 2560000 Aug 30 02:33 zs
drwx--S--- 2 root root 2490368 Aug 30 03:13 zt
drwx--S--- 2 root root 2555904 Aug 30 03:59 zu
drwx--S--- 2 root root 1273856 Aug 30 04:23 zv
drwx--S--- 2 root root 2441216 Aug 30 05:10 zw
Deletion of pieces follows again the unsorted order:
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-07-06/dk | head -n10
total 3648
-rw-r--r-- 1 root root 3328 May 23 13:03 pwrzfdiixf6zm4huadjp7m42zvxfzkdb5kmqfcwfb7fzkaug7a.sj1
-rw-r--r-- 1 root root 291584 Jun 2 20:10 a32h7rb3sjewux34r67vwux5umdr6nduewncbbzwqlkv2lzs4a.sj1
-rw------- 1 root root 290816 May 1 01:08 4kpcijpzj3l34ayvowtmw2dbccnha6uiwktdmw2wxnmbxzg7ka.sj1
-rw-r--r-- 1 root root 320256 May 20 18:26 lkjc7b7bw6v6nrofvtgztqjwmeiyu45mqp5uqygp22wfbux7xq.sj1
-rw-r--r-- 1 root root 406272 Jun 6 07:46 t734vgpatxagruaha5nnb3dezpecr7s2pm2o73xhlpa6xoslsq.sj1
-rw-r--r-- 1 root root 3584 Jun 1 13:25 cpxqqjlsz4qaakfedprigps5o4klrqds7cxaidh5yjfvuqjy2q.sj1
-rw-r--r-- 1 root root 290560 May 23 02:37 3eajfiypiahq5evpegofh5efakrxflr44gvnwl4t43e7i27zaa.sj1
-rw-r--r-- 1 root root 1792 Jun 7 03:36 wm3wixambmhp4slmgy5grqllces6l5socaxhgtyr4k2uetqnyq.sj1
-rw-r--r-- 1 root root 7168 Jun 1 01:36 oq4leijdoenu3vb5xt56kmas37acnom46kkyzgvsvwjnyafk4q.sj1
This proves that your claim

we unlikely request any sort order from the OS, it returns the list and we process it, it’s a fastest way.
cannot be entirely true. For the 2-letter subfolders there is a sorting by modified date in place.
Maybe this is still from the old deletion process before the date folders were introduced when there were only 2-letter subfolders in the namespace and they might have been sorted by last modified date for deletion? I don’t know.
But to make it clear once more: My suggestion does not touch the sort order of pieces or 2-letter subfolders. My suggestion is only about the order in which the date folders get processed. And I suggest to sort and process them by oldest first.
My suggested processing order of date folders
ls -1ltr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
total 204
drwx--S--- 56 root root 4096 Jul 6 03:15 2024-07-05
drwx--S--- 52 root root 4096 Jul 7 03:11 2024-07-06
drwx--S--- 37 root root 4096 Jul 8 02:43 2024-07-07
drwx--S--- 42 root root 4096 Jul 9 03:04 2024-07-08
drwx--S--- 33 root root 4096 Jul 13 01:27 2024-07-12
drwx--S--- 21 root root 4096 Jul 14 02:43 2024-07-13
drwx--S--- 22 root root 4096 Jul 16 02:20 2024-07-15
drwx--S--- 22 root root 4096 Jul 22 02:22 2024-07-21
drwx--S--- 27 root root 4096 Jul 23 03:03 2024-07-22
drwx--S--- 9 root root 4096 Jul 23 13:22 2024-07-23
drwx--S--- 21 root root 4096 Jul 25 03:23 2024-07-24
drwx--S--- 70 root root 4096 Jul 27 02:53 2024-07-26
drwx--S--- 63 root root 4096 Jul 29 23:06 2024-07-29
drwx--S--- 196 root root 4096 Aug 1 02:59 2024-07-31
drwx--S--- 475 root root 12288 Aug 4 02:08 2024-08-03
drwx--S--- 354 root root 12288 Aug 5 03:12 2024-08-04
drwx--S--- 146 root root 4096 Aug 6 02:50 2024-08-05
drwx--S--- 256 root root 4096 Aug 7 03:24 2024-08-06
drwx--S--- 260 root root 4096 Aug 8 02:34 2024-08-07
drwx--S--- 101 root root 4096 Aug 9 03:13 2024-08-08
drwx--S--- 423 root root 12288 Aug 10 02:14 2024-08-09
drwx--S--- 14 root root 4096 Aug 11 01:24 2024-08-10
drwx--S--- 36 root root 4096 Aug 15 02:14 2024-08-14
drwx--S--- 554 root root 12288 Aug 17 01:25 2024-08-16
drwx--S--- 11 root root 4096 Aug 17 23:49 2024-08-17
drwx--S--- 14 root root 4096 Aug 19 00:02 2024-08-18
drwx--S--- 198 root root 4096 Aug 21 02:54 2024-08-20
drwx--S--- 311 root root 4096 Aug 22 02:52 2024-08-21
drwx--S--- 34 root root 4096 Aug 23 01:06 2024-08-22
drwx--S--- 12 root root 4096 Aug 24 02:55 2024-08-23
drwx--S--- 130 root root 4096 Aug 25 03:11 2024-08-24
drwx--S--- 450 root root 12288 Aug 26 02:59 2024-08-25
drwx--S--- 19 root root 4096 Aug 26 23:33 2024-08-26
drwx--S--- 22 root root 4096 Aug 28 02:20 2024-08-27
drwx--S--- 21 root root 4096 Aug 29 01:12 2024-08-28
drwx--S--- 32 root root 4096 Aug 30 04:23 2024-08-29
drwx--S--- 222 root root 4096 Aug 31 03:00 2024-08-30
drwx--S--- 417 root root 12288 Sep 1 02:54 2024-08-31
drwx--S--- 44 root root 4096 Sep 1 09:07 2024-09-01
So from my view a valid comparison would have been to check how long does it take to sort the date folders → sort the 2-letter subfolders → stat the pieces in unsorted order vs. don’t sort the date folders → sort the 2-letter subfolders → stat the pieces in unsorted order.
As after all of this I believe that most time was spent for sorting pieces and subfolders, I believe my suggestion to sort date folders will have little to no impact on performance. Therefore I stand by my suggestion.
But as you have brought up the matter of performance I want to share what I have read through:
While I have agreed above, that sorting can have a negative performance impact on a large number of files vs. unsorted, this does not necessarily mean that unsorted is always the best and fastest choice.
The sources I have found even suggested, that sorting I/O requests in a specific way could lead to a better performance. So I think it is worth to have a look at it. (Maybe also something for @elek)
As inodes are involved in the deletion process, there are suggestions that sorting I/O requests by inodes would be faster. For ext4 I can say, this could be true as this would make use of the inode caching and read-ahead feature.
Here is an example ls output of the date folders sorted by inode
ls -1i /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ | sort -r
593521316 2024-09-01
593143993 2024-08-31
593041899 2024-08-30
592231096 2024-08-29
591685746 2024-08-28
591225174 2024-08-27
590823958 2024-08-26
590438915 2024-08-25
589998827 2024-08-24
588997086 2024-08-22
588341255 2024-08-21
587938075 2024-08-20
586996106 2024-08-18
586416324 2024-08-17
585826424 2024-08-16
584452324 2024-08-14
582040054 2024-08-10
581416965 2024-08-09
580610220 2024-08-08
579915994 2024-08-07
579395735 2024-08-06
578906606 2024-08-05
578451515 2024-08-04
578113589 2024-08-03
576301641 2024-07-31
575480158 2024-07-29
574257524 2024-07-26
572881041 2024-07-24
572319751 2024-07-23
571675244 2024-07-22
571054663 2024-07-21
568348944 2024-07-15
566890923 2024-07-13
566321630 2024-07-12
563407269 2024-07-08
562745426 2024-07-07
561930283 2024-07-06
561478332 2024-07-05
117237791 2024-08-23
I am not sure sorting folders by inodes would gain much. But where it could make a huge difference could be on the deletion of the pieces. There could be 2 approaches either recursively sorting all pieces in a date folder or by subfolder.
ls output with pieces sorted by inode number
ls -1i /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-07-06/do | sort | head -n10
100598232 iklr4tggcyvftsvj4skvfiswhpnobgf2ee3bowdazvu5zaqnhq.sj1
100629994 be56okexdkglhokj7ame6ipf7tuikw4zywjikq65wisvqebv2q.sj1
100630524 fxj3qkwzcmyz3wa3urivexppshm75cgluphknrmarums7ype7a.sj1
102506879 mfey6vdxfqr56uq75vrvskwuajhtl2vijsoozxy7t6oyuduqoa.sj1
102507044 3xnbdrxc7ysfxyrg5zz4uoukhglnlkj6jmnyayjf2lmdbwj5qq.sj1
102510611 ng7vnin6snnbcwieqb3iimyjidqsc5ofmjwf6lfltlkwwwcarq.sj1
102514718 wlfmyzlyynjrefroxvcthbhm3uzf6ejvieyfdlkingb5e3ghuq.sj1
102519379 tdp47osqmzinfhbmcbqbbfkyvrv5docg37gwoez5jegcfrfz4q.sj1
102519706 cyihkwa5eam2uyx7q3fd7biouxscdgarf3mbz36gqkbr7m6pva.sj1
102522728 iwwi6n3oxhivwnnowpvfpo7zie5fiyz7cbpvnandlydkjscica.sj1
This might maximize the use of inode caching capabilitites and prefetching of the ext4 file system and reduce seek times of the underlying hardware.
These are my references which suggest exactly that, that sorting of IO requests can speed things up a lot. They go even beyond inode sorting and suggest sorting by physical proximity:
Questions and comments: performance - Read files by device/inode order? - Stack Overflow
Comment: performance - Read files by device/inode order? - Stack Overflow
it has been shown that userspace sorting of files by inode and physical location does improve performance by large integer factors (of course dependent on file size, more so for small files). See this paper and presentation, which include useful graphs. They report that their
qtar
tar implementation usingFIEMAP
is 4x faster in tarring the Linux kernel source tree.
FIEMAP:
FIEMAP: FIEMAP is a kernel feature that can be used by applications through the
fiemap
system call. To use FIEMAP, an application must explicitly request it by calling thefiemap
system call and passing the necessary parameters.The
fiemap
system call is used to retrieve the file information, including the inode information, for a given file. The call takes several parameters, including the file descriptor of the file, the offset and length of the file range to retrieve, and a pointer to astruct fiemap
structure that will contain the retrieved file information.
Ordering Requests to Accelerate Disk I/O | Piotr Kołaczkowski
In the earlier post I showed how accessing data on an SSD in parallel can greatly improve read performance. However, that technique is not very effective for...
We can do better. Some file systems, like Linux EXT4, offer an API for fetching information about file extents: FIEMAP ioctl. We can use this API to get a data structure that contains information on the physical placement of the file data. Then, the physical placement of the beginning of the data can be used to sort the files so that we can process all files in a single sweep. A great news is that this API is also available for non-root users.
The order of file I/O requests has a tremendous impact on I/O performance on spinning drives. If your application needs to process a batch of small files, make sure you request them in the same order as their physical placement on disk. If you can’t do it because your file system or your operating system does not provide physical block placement information, at least sort the files by their identifiers.
And here is a general paper that was referenced: Improving File Tree Traversal Performance by Scheduling I/O Operations in User Space
(I had already linked to another paper of the same author in the past: Improving Disk I/O Performance on Linux)
I don’t know, but maybe the badger cache could be used to cache inode numbers additionally to file sizes when it stats the pieces and maybe then this could be used to do sorting on folders and pieces in a more efficient way so it makes sense and gains performance.
To sum it up: I do believe the processing order of the date folders should follow FIFO and not leave it randomly on whatever OS is in use. The oldest folder should get processed first. This is straightforward, logical, easy to follow and does not leave clutter or half emptied folders behind. Inside the date folders it should be checked, if orderdering by Inode, physical location or use of FIEMAP could speed up the deletion process even beyond the current use of unsorted I/O requests.