Why trash not deleted

I still do not see a difference, sorry. 300GB removed trash in a sorted reverse order is still 300GB removed trash in unsorted order.

I would give you a simple example to experiment:

time ls -lRU /mnt/storj/storagenode/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l

then purge the cache (to get a comparable results)

echo 3 | sudo tee /proc/sys/vm/drop_caches

then the next run (by modification time in a reverse order):

time ls -lRtr /mnt/storj/storagenode/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l

P.S. however, it will slow down your system because of dropped caches and it would start to build them again.

1 Like

I still don’t understand why you are insisting on the modification time when the name of the folder is the date I am talking about?
As said, I don’t understand the current logic which date folder gets picked first and I have no idea how the OS provides that list. But there must be some logic in place to prevent to select a folder that is not yet old enough for deletion. So there must be some kind of name sorting, checking or testing to prevent that.
When I do a normal ls then I get the name folders sorted in the exact order that I am talking about:

So I don’t understand why it would be complicated, slow or expensive to start with date folder 2024-07-05 first. This is the order a simple ls command provides the folder list to me.

1 Like

The folder is picked randomly or maybe in a reverse order, I do not know what order is returned by your OS, if there is no request to sort it.
I guess the modification time, from the newest to the oldest per satellite I would assume.

But it’s still doesn’t matter at all, the deletion speed is roughly the same. So it will not change the fact that your node is able to delete it in the same speed independently of the sort order.
So, sorry, you didn’t convince me. The logic is against this proposal, especially if it will slow down the deletion.

If you do not believe me - just run commands above an report the time, which is needed to just stat these pieces without any sort order or with sort order by time. You may remove -t from the second command to get an alphabetical order if you wish.

At no point did I suggest it was faster. What I am saying is that it leaves clutter, is harder to monitor, breaks ethics and governance and expectations by customers. I doubt that any customer would expect his data still on nodes 8 weeks after he requested deletion.

Both commands are slow for me, so I cannot run them now. But I am not sure if that is useful at all, because I doubt this is how the deletion is implemented. I don’t think it is recursively going through all date folders and subfolders, at least it sounds that it does not make sense to do so. I would expect an implementation that lists the date folders and take the first one from the list, checks if the name is ready for deletion and then processes the subfolders inside of that folder. Listing only the date folders one way or the other does not show any difference for me in performance. So for the order of the date folders it would not matter.
But from what I read, I agree that ls -U is faster on folders with huge number of files compared to regular ls that seems to sort by name.
What I am seeing is that it seems that the processing order of the date folders is following the unsorted order by ls -U. From what I read it is not at all random but also “some” order ( I could not figure out which order it follows). But: For deletions inside of the date folder you are not following the unsorted order by ls -U, but processing strictly alphabetically from a-z, then numbers. I have just compared the different outputs from subfolders in a date folder with ls and ls -U.
So even if I do agree given the fact is correct, that ls -U has a performance advantage, this leaves the question, why it is not done so inside the date folders and if the advantage is so huge for the order of the date folders which are only a few. So I doubt this. BTW, my tests show that the deletion of the pieces inside the subfolders are again unsorted and this is probably where the biggest performance gain is over sorting them first.
So the question would be: If sorting is so much slower, why do you process the subfolders in alphabetically order? I believe this would have more impact than sorting a small number of date folders. When I do an ls in a trash folder, it has no performance impact whether the date folders get displayed sorted or unsorted.

I think what is proposed here is not to sort the listing of all files, just the listing of the root trash folders with dates, which would be instantaneous. Then go in and process what’s inside in any order. Pseudo-code:

var trashQueue = "storage/trash/satid/*" non-recursive folders sorted by name ascending
while(trashQueue is not empty){
     var folder = trashQueue.next();

    // Cleanup as necessary
}

A benefit I could see is to assess whether trash is accumulating faster than it gets deleted with current setup/settings, or if older trash is never being deleted due to a bug.

3 Likes

Yes, all he’s ever wanted was the top level folders to delete by oldest stated date first. After much head banging in this thread, I’m hoping this is finally interpreted correctly.
I gave up, such trivial miscommunication isn’t worth my time.
Thanks for your input mtone.

2 cents

1 Like

Unfortunately there is no other way to calculate the size of the deleted pieces to subtract this amount from the used space by the trash in the databases.
There is only two way to implement it using the filesystem only:

  1. Stat recursively the folder which should be deleted, then call to delete each file recursively (two walks);
  2. Stat each pieces recursively inside the folder which should be deleted and delete each piece right away (one walk).

As I said, we unlikely request any sort order from the OS, it returns the list and we process it, it’s a fastest way. What is you suggest is to order this list after, which is mean an additional time, CPU cycles and more RAM usage (or even worse, something could be pushed to the swap to allow to sort the list in the RAM) to then delete these files.
Waste of time. These pieces are 1/80 of the segment of the encrypted file, it’s not identificable customers’ data to meet your concerns, after it’s unregistered on the satellite it become a … trash.

I looked into the code, not pretending that I understand much.

But it seems to be this part I am talking about:

At least the function listTrashDayDirs sounds like it is the one responsible for listing the date folders.
It executed through the EmptyTrash function

I am not sure but it looks like the crucial part is this:

for _, subdirName := range subdirNames {
			subdirTime, err := time.Parse("2006-01-02", subdirName)
			if err != nil {
				// just an invalid subdir; could be garbage of many kinds. probably
				// don't need to pass on this error
				continue
			}
			dirTimes = append(dirTimes, subdirTime)
		}

As I am not a coder I asked an AI to sort it by date oldest first:

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"path/filepath"
	"sort"
	"time"
)

// ... (rest of the code remains the same)

func (dir *Dir) listTrashDayDirs(ctx context.Context, namespace []byte) (dirTimes []time.Time, err error) {
	// ... (rest of the code remains the same)

	for _, subdirName := range subdirNames {
		subdirTime, err := time.Parse("2006-01-02", subdirName)
		if err != nil {
			// just an invalid subdir; could be garbage of many kinds. probably
			// don't need to pass on this error
			continue
		}
		dirTimes = append(dirTimes, subdirTime)
	}

	// Sort the dirTimes slice in ascending order (oldest first)
	sort.Slice(dirTimes, func(i, j int) bool {
		return dirTimes[i].Before(dirTimes[j])
	})

	return dirTimes, nil
}

Of course I have no idea if this would be the correct thing to get the desired result

I think you are misunderstanding me. The ls command that you wanted me to execute displays lists all pieces from all date subfolders. This is what I doubt the code is doing. At least it is my thinking, that it does not make sense for the deletion function to list all pieces that are in the trash. But it makes sense to list the date folders → select a date folder → select first subfolder → list pieces.
This is what I would be doing. But as said, I am not a codes and the way I think is probably very much different to yours.

This is exactly what I mean when answering with

This is a closest analogue. Did you notice the -l flag? It’s a stat, it would take the creation date, the size and the name (also other irrelevant information like an owner, I may exclude it, but for comparison between an unsorted and sorted output it’s enough). Actually I must use an unsorted output in both commands and then pass thru a sort command for the sorted way, then to the rm command for the deletion. For the unsorted way pass thru to the rm command only.
But I didn’t this, because the main difference only in sorting or not sorting. And if I would give you the whole command, you wouldn’t be able to compare, because the first run would remove all the pieces from the trash (and screwed your dashboard and pushed to the risk to be disqualified).

No unfortunately this did not convince me either. I don’t think that analogy fits at least for 2 reasons:
First reason is that it does not reflect my suggestion. Second reason is, that based on my observation this is not what the code is doing.

  1. Despite my doubts, I have run the command, results are:
time ls -lRU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l
5613197

real    584m5.243s
user    0m31.431s
sys     10m56.406s


time ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa | wc -l
5613197

real    770m9.466s
user    0m37.127s
sys     12m55.709s

Yes, there is a difference between sorted and unsorted and I have said it above based on what I have read on the internet, that with a large number of files sorting seems to be slower than unsorted.
But the question is, does this reflect what I have suggested? And the answer is no, it does not. Because the sorting command you have suggested also sorts the files by their dates as it can be proved:

ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/:
ls -lRtr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/:
total 64
drwx--S---  38 root root 4096 Jun 21 15:56 2024-06-19
drwx--S---  51 root root 4096 Jun 28 17:27 2024-06-25
drwx--S---  22 root root 4096 Jul  4 02:42 2024-06-28
drwx--S---   9 root root 4096 Jul  5 00:49 2024-07-04
drwx--S--- 104 root root 4096 Jul  7 16:16 2024-07-06
drwx--S---  40 root root 4096 Jul  8 18:00 2024-07-07
drwx--S---   4 root root 4096 Jul  8 23:33 2024-07-08

/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-06-19:
total 32456
drwx--S--- 2 root root  860160 Jun 19 20:52 am
drwx--S--- 2 root root  851968 Jun 19 21:14 an
drwx--S--- 2 root root  835584 Jun 19 21:48 ao
drwx--S--- 2 root root  823296 Jun 19 22:28 ap
drwx--S--- 2 root root  847872 Jun 19 23:04 aq
drwx--S--- 2 root root  847872 Jun 20 00:00 ar
drwx--S--- 2 root root  851968 Jun 20 00:48 as
drwx--S--- 2 root root  847872 Jun 20 01:41 at
drwx--S--- 2 root root  851968 Jun 20 02:34 au
drwx--S--- 2 root root  851968 Jun 20 03:27 av
drwx--S--- 2 root root  843776 Jun 20 04:05 aw
drwx--S--- 2 root root  843776 Jun 20 05:04 ax
drwx--S--- 2 root root  856064 Jun 20 06:02 ay
drwx--S--- 2 root root  843776 Jun 20 07:08 az
drwx--S--- 2 root root  868352 Jun 20 08:27 a2
drwx--S--- 2 root root  856064 Jun 20 09:42 a3
drwx--S--- 2 root root  860160 Jun 20 11:03 a4

/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-06-19/am:
total 863280
-rw------- 1 root root    6400 May 15  2022 bafdu7sdqdtgit5w5a3mxfhscmwbgkdqpxeujooskye3bd6goq.sj1
-rw------- 1 root root  148224 Jun  6  2022 wlqqou3vptvlvcprhfnqrrifvau5gd4v6h23h7naub22fzbobq.sj1
-rw------- 1 root root  178944 Jun 15  2022 x2vpu4bhftpifk7mqzt6yffw2wxiwefxgzgn2gpik3efzc5roa.sj1
-rw------- 1 root root    4096 Jun 21  2022 rwsezjcfdmmwe74ufej27go6cfxze4usu3pbon53v2asssr5sq.sj1
-rw------- 1 root root    9472 Jul  3  2022 smvmvd6yo2e7shulco2ry73lglx7mm45lcsw2ozzymbeewlvpq.sj1
-rw------- 1 root root   65792 Jul 17  2022 7iimrn3xmhlvryolt5ljmiuriow5znzv4kg7k6sgmhmqm5iesa.sj1
-rw------- 1 root root   74496 Aug  5  2022 yqmnkk4gdtutopzygjj7lunuhmso63b3afqixg23pktduv6n6q.sj1
-rw------- 1 root root   74240 Aug  5  2022 smjupeyewoy55jxsbpxtup2q2zib5t7rvnxrdblwn2zgr6thmq.sj1
-rw------- 1 root root   74240 Aug  5  2022 q3v7utalcskoad76yt6refplszu7glkn7gwktgl37h6ed72dja.sj1
-rw------- 1 root root   74240 Aug  6  2022 lotuakuok3rhlyas2rwetlqbp7m3fjknfohzazxfjny7ucmyzq.sj1
-rw------- 1 root root   74496 Aug 13  2022 cahanqsuzk4lhehmeo2jsgnf4udbjagr3qgp3eeqsgqcc6heuq.sj1
-rw------- 1 root root   74240 Aug 18  2022 axkvpyjaj2e3pcfinfosoumzckccnxtlyyb44fao6ob55txfaa.sj1
-rw------- 1 root root   74240 Aug 23  2022 e3rwrzakhwlahvbjp5p473nc32dgswpst4zslkssirmmrs4pya.sj1
-rw------- 1 root root   77056 Aug 23  2022 g2ng24je2eaf4nmq54sspafbgo4gejjwvw2dxecqbvoqjswc2q.sj1
-rw------- 1 root root   73984 Aug 23  2022 un4xy6qfuegewxyaxeicn3sowz7ra4hh5drnyjt7bvjjlgt7eq.sj1
-rw------- 1 root root   74496 Aug 24  2022 msb2o24cotfqpqubpifwnmtnwhppnmsmfsqn5bpm5tfnhoa44a.sj1
-rw------- 1 root root   37376 Aug 24  2022 ffrdvmiyzjzv4gflfzvr3xtmylzoopwi2hg5zwpclomz67i3gq.sj1
-rw------- 1 root root  399616 Aug 25  2022 aukvwiindjrwrhtfnczndtpfyfa42gl57jiwa4ugap2nicqbja.sj1
-rw------- 1 root root  399616 Aug 26  2022 qznqozqp2b23rdumxc34dyrkjgjdzrsbz2z4z6bkb5v47p5wpq.sj1
-rw------- 1 root root   74240 Aug 26  2022 ancenmhkvdf6jtu52tiatowhvrmhu45e4kytqe5jatrlyxcguq.sj1
-rw------- 1 root root   74240 Aug 27  2022 ljgvfa6hxo5caut5nz76l4mlnx4r2j2jsbhv76vofzna7yv3xa.sj1
-rw------- 1 root root   27392 Aug 27  2022 5k65usvu43lf6rrnqe3c34p3l4diq6jc4ylz2npyv4yesiwz3q.sj1
-rw------- 1 root root   74240 Aug 27  2022 d5iq6keurs2pmyqmng3ptrciiza4jxofprnmm462j634ssysbq.sj1
-rw------- 1 root root   74240 Aug 29  2022 pfyqu5hawoh7creiajd7ft4qspemcj4ot4vfkwegirqfllutja.sj1
-rw------- 1 root root 2319872 Sep  8  2022 oh4mfjfovat22o62mcoylpjh6rzbvbx5jcuill7bstzebfjjvq.sj1
-rw------- 1 root root 2319872 Sep  9  2022 omijm6pn7qos7z2bodhrggk5j5vjnpivzftejx7vumzdv6d4nq.sj1
-rw------- 1 root root    5120 Sep 11  2022 4kwcsbzpl4mmqzps2sucpfnohm63cpsm26u2h7ikalna35ffka.sj1
-rw------- 1 root root 2319872 Sep 16  2022 afakiiinkfc6myt6oxme4h4a2rsstiftjgykoopgbqwb3xz5wa.sj1
-rw------- 1 root root  158464 Sep 22  2022 spk2zlj7tw6yyfhw22ibbeaejbktvuapjgwtanxl642thu4obq.sj1

I have to repeat it again, that my suggestion is not to sort the piece files by date. Not even the 2-letter subdirectories. Only the date folders. So I am still convinced that your analogy does not prove that sorting of date folders will have or would have a performance disadvantage.

There is also other evidence for that and that leads to my second reason:

  1. The code does not do what you are suggesting in 2 different ways:
  • Your command processes the entire namespace path which includes all date folders, all 2-letter subfolders and all pieces. I did not find evidence that the code is doing that. This also does not makes sense from a logical standpoint. You cannot pass the entire namespace because you have to check the age of the date folder if it is old enough for deletion. So your command would only apply inside of the date folders but this is already after the sorting and selecting the date folder as I have suggested.
  • I have observed what the processing order is and it is unsorted on the date folders, then inside the date folders the 2-letter subfolders get sorted by modified date and lastly the pieces inside the 2-letter subfolders get deleted by unsorted order again. So your suggested command that sorts by modified time and that you claim to slow down everything is partly exactly what the storagenode code is doing when it is processing the 2-letter subfolders: It sorts them by date. This proves that there is no performance penalty in doing so and it is also strong evidence that my suggestion to sort only the date folders would have no negative performance impact as well.
This is the unsorted output of the date folders which is the order they get processed
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
total 204
drwx--S---  32 root root  4096 Aug 30 04:23 2024-08-29
drwx--S---  21 root root  4096 Aug 29 01:12 2024-08-28
drwx--S--- 196 root root  4096 Aug  1 02:59 2024-07-31
drwx--S---  42 root root  4096 Jul  9 03:04 2024-07-08
drwx--S---  21 root root  4096 Jul 25 03:23 2024-07-24
drwx--S---  52 root root  4096 Jul  7 03:11 2024-07-06
drwx--S--- 475 root root 12288 Aug  4 02:08 2024-08-03
drwx--S--- 554 root root 12288 Aug 17 01:25 2024-08-16
drwx--S--- 311 root root  4096 Aug 22 02:52 2024-08-21
drwx--S---  63 root root  4096 Jul 29 23:06 2024-07-29
drwx--S--- 354 root root 12288 Aug  5 03:12 2024-08-04
drwx--S---  22 root root  4096 Aug 28 02:20 2024-08-27
drwx--S--- 450 root root 12288 Aug 26 02:59 2024-08-25
drwx--S--- 423 root root 12288 Aug 10 02:14 2024-08-09
drwx--S--- 198 root root  4096 Aug 21 02:54 2024-08-20
drwx--S---  21 root root  4096 Jul 14 02:43 2024-07-13
drwx--S--- 146 root root  4096 Aug  6 02:50 2024-08-05
drwx--S--- 130 root root  4096 Aug 25 03:11 2024-08-24
drwx--S---   9 root root  4096 Jul 23 13:22 2024-07-23
drwx--S---  70 root root  4096 Jul 27 02:53 2024-07-26
drwx--S---  44 root root  4096 Sep  1 09:07 2024-09-01
drwx--S---  14 root root  4096 Aug 11 01:24 2024-08-10
drwx--S---  56 root root  4096 Jul  6 03:15 2024-07-05
drwx--S---  27 root root  4096 Jul 23 03:03 2024-07-22
drwx--S---  12 root root  4096 Aug 24 02:55 2024-08-23
drwx--S---  14 root root  4096 Aug 19 00:02 2024-08-18
drwx--S--- 417 root root 12288 Sep  1 02:54 2024-08-31
drwx--S--- 256 root root  4096 Aug  7 03:24 2024-08-06
drwx--S--- 222 root root  4096 Aug 31 03:00 2024-08-30
drwx--S--- 101 root root  4096 Aug  9 03:13 2024-08-08
drwx--S---  37 root root  4096 Jul  8 02:43 2024-07-07
drwx--S---  19 root root  4096 Aug 26 23:33 2024-08-26
drwx--S---  36 root root  4096 Aug 15 02:14 2024-08-14
drwx--S--- 260 root root  4096 Aug  8 02:34 2024-08-07
drwx--S---  22 root root  4096 Jul 16 02:20 2024-07-15
drwx--S---  33 root root  4096 Jul 13 01:27 2024-07-12
drwx--S---  22 root root  4096 Jul 22 02:22 2024-07-21
drwx--S---  11 root root  4096 Aug 17 23:49 2024-08-17
drwx--S---  34 root root  4096 Aug 23 01:06 2024-08-22
This is the unsorted output of the 2-letters subfolders of a date folder. This is NOT the order they get processed
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-08-29/
total 64648
drwx--S--- 2 root root 1236992 Aug 29 19:38 zl
drwx--S--- 2 root root 1245184 Aug 29 12:42 ze
drwx--S--- 2 root root 1257472 Aug 29 04:20 yu
drwx--S--- 2 root root 1273856 Aug 30 04:23 zv
drwx--S--- 2 root root 1961984 Aug 29 07:03 yx
drwx--S--- 2 root root 1961984 Aug 29 22:59 zp
drwx--S--- 2 root root 1925120 Aug 29 09:47 za
drwx--S--- 2 root root 1937408 Aug 29 19:08 zk
drwx--S--- 2 root root 1929216 Aug 29 06:17 yw
drwx--S--- 2 root root 1925120 Aug 29 21:40 zn
drwx--S--- 2 root root 2490368 Aug 30 03:13 zt
drwx--S--- 2 root root 2560000 Aug 29 22:25 zo
drwx--S--- 2 root root 2523136 Aug 29 10:25 zb
drwx--S--- 2 root root 2543616 Aug 29 14:52 zg
drwx--S--- 2 root root 2555904 Aug 30 03:59 zu
drwx--S--- 2 root root 2547712 Aug 29 17:03 zi
drwx--S--- 2 root root 2551808 Aug 29 12:34 zd
drwx--S--- 2 root root 2551808 Aug 29 11:20 zc
drwx--S--- 2 root root  835584 Aug 29 03:26 yt
drwx--S--- 2 root root 2560000 Aug 30 02:33 zs
drwx--S--- 2 root root 2547712 Aug 29 07:59 yy
drwx--S--- 2 root root 2564096 Aug 29 13:59 zf
drwx--S--- 2 root root 2584576 Aug 29 05:28 yv
drwx--S--- 2 root root 2555904 Aug 30 01:12 zr
drwx--S--- 2 root root 2592768 Aug 29 15:59 zh
drwx--S--- 2 root root 2551808 Aug 29 20:44 zm
drwx--S--- 2 root root 2609152 Aug 30 00:16 zq
drwx--S--- 2 root root 2576384 Aug 29 18:19 zj
drwx--S--- 2 root root 2576384 Aug 29 09:25 yz
drwx--S--- 2 root root 2441216 Aug 30 05:10 zw
However this is the real order the 2-letter subfolders get processed
ls -1ltr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-08-29/
total 64648
drwx--S--- 2 root root  835584 Aug 29 03:26 yt
drwx--S--- 2 root root 1257472 Aug 29 04:20 yu
drwx--S--- 2 root root 2584576 Aug 29 05:28 yv
drwx--S--- 2 root root 1929216 Aug 29 06:17 yw
drwx--S--- 2 root root 1961984 Aug 29 07:03 yx
drwx--S--- 2 root root 2547712 Aug 29 07:59 yy
drwx--S--- 2 root root 2576384 Aug 29 09:25 yz
drwx--S--- 2 root root 1925120 Aug 29 09:47 za
drwx--S--- 2 root root 2523136 Aug 29 10:25 zb
drwx--S--- 2 root root 2551808 Aug 29 11:20 zc
drwx--S--- 2 root root 2551808 Aug 29 12:34 zd
drwx--S--- 2 root root 1245184 Aug 29 12:42 ze
drwx--S--- 2 root root 2564096 Aug 29 13:59 zf
drwx--S--- 2 root root 2543616 Aug 29 14:52 zg
drwx--S--- 2 root root 2592768 Aug 29 15:59 zh
drwx--S--- 2 root root 2547712 Aug 29 17:03 zi
drwx--S--- 2 root root 2576384 Aug 29 18:19 zj
drwx--S--- 2 root root 1937408 Aug 29 19:08 zk
drwx--S--- 2 root root 1236992 Aug 29 19:38 zl
drwx--S--- 2 root root 2551808 Aug 29 20:44 zm
drwx--S--- 2 root root 1925120 Aug 29 21:40 zn
drwx--S--- 2 root root 2560000 Aug 29 22:25 zo
drwx--S--- 2 root root 1961984 Aug 29 22:59 zp
drwx--S--- 2 root root 2609152 Aug 30 00:16 zq
drwx--S--- 2 root root 2555904 Aug 30 01:12 zr
drwx--S--- 2 root root 2560000 Aug 30 02:33 zs
drwx--S--- 2 root root 2490368 Aug 30 03:13 zt
drwx--S--- 2 root root 2555904 Aug 30 03:59 zu
drwx--S--- 2 root root 1273856 Aug 30 04:23 zv
drwx--S--- 2 root root 2441216 Aug 30 05:10 zw
Deletion of pieces follows again the unsorted order:
ls -1lU /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-07-06/dk | head -n10
total 3648
-rw-r--r-- 1 root root   3328 May 23 13:03 pwrzfdiixf6zm4huadjp7m42zvxfzkdb5kmqfcwfb7fzkaug7a.sj1
-rw-r--r-- 1 root root 291584 Jun  2 20:10 a32h7rb3sjewux34r67vwux5umdr6nduewncbbzwqlkv2lzs4a.sj1
-rw------- 1 root root 290816 May  1 01:08 4kpcijpzj3l34ayvowtmw2dbccnha6uiwktdmw2wxnmbxzg7ka.sj1
-rw-r--r-- 1 root root 320256 May 20 18:26 lkjc7b7bw6v6nrofvtgztqjwmeiyu45mqp5uqygp22wfbux7xq.sj1
-rw-r--r-- 1 root root 406272 Jun  6 07:46 t734vgpatxagruaha5nnb3dezpecr7s2pm2o73xhlpa6xoslsq.sj1
-rw-r--r-- 1 root root   3584 Jun  1 13:25 cpxqqjlsz4qaakfedprigps5o4klrqds7cxaidh5yjfvuqjy2q.sj1
-rw-r--r-- 1 root root 290560 May 23 02:37 3eajfiypiahq5evpegofh5efakrxflr44gvnwl4t43e7i27zaa.sj1
-rw-r--r-- 1 root root   1792 Jun  7 03:36 wm3wixambmhp4slmgy5grqllces6l5socaxhgtyr4k2uetqnyq.sj1
-rw-r--r-- 1 root root   7168 Jun  1 01:36 oq4leijdoenu3vb5xt56kmas37acnom46kkyzgvsvwjnyafk4q.sj1

This proves that your claim

cannot be entirely true. For the 2-letter subfolders there is a sorting by modified date in place.

Maybe this is still from the old deletion process before the date folders were introduced when there were only 2-letter subfolders in the namespace and they might have been sorted by last modified date for deletion? I don’t know.

But to make it clear once more: My suggestion does not touch the sort order of pieces or 2-letter subfolders. My suggestion is only about the order in which the date folders get processed. And I suggest to sort and process them by oldest first.

My suggested processing order of date folders
ls -1ltr /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
total 204
drwx--S---  56 root root  4096 Jul  6 03:15 2024-07-05
drwx--S---  52 root root  4096 Jul  7 03:11 2024-07-06
drwx--S---  37 root root  4096 Jul  8 02:43 2024-07-07
drwx--S---  42 root root  4096 Jul  9 03:04 2024-07-08
drwx--S---  33 root root  4096 Jul 13 01:27 2024-07-12
drwx--S---  21 root root  4096 Jul 14 02:43 2024-07-13
drwx--S---  22 root root  4096 Jul 16 02:20 2024-07-15
drwx--S---  22 root root  4096 Jul 22 02:22 2024-07-21
drwx--S---  27 root root  4096 Jul 23 03:03 2024-07-22
drwx--S---   9 root root  4096 Jul 23 13:22 2024-07-23
drwx--S---  21 root root  4096 Jul 25 03:23 2024-07-24
drwx--S---  70 root root  4096 Jul 27 02:53 2024-07-26
drwx--S---  63 root root  4096 Jul 29 23:06 2024-07-29
drwx--S--- 196 root root  4096 Aug  1 02:59 2024-07-31
drwx--S--- 475 root root 12288 Aug  4 02:08 2024-08-03
drwx--S--- 354 root root 12288 Aug  5 03:12 2024-08-04
drwx--S--- 146 root root  4096 Aug  6 02:50 2024-08-05
drwx--S--- 256 root root  4096 Aug  7 03:24 2024-08-06
drwx--S--- 260 root root  4096 Aug  8 02:34 2024-08-07
drwx--S--- 101 root root  4096 Aug  9 03:13 2024-08-08
drwx--S--- 423 root root 12288 Aug 10 02:14 2024-08-09
drwx--S---  14 root root  4096 Aug 11 01:24 2024-08-10
drwx--S---  36 root root  4096 Aug 15 02:14 2024-08-14
drwx--S--- 554 root root 12288 Aug 17 01:25 2024-08-16
drwx--S---  11 root root  4096 Aug 17 23:49 2024-08-17
drwx--S---  14 root root  4096 Aug 19 00:02 2024-08-18
drwx--S--- 198 root root  4096 Aug 21 02:54 2024-08-20
drwx--S--- 311 root root  4096 Aug 22 02:52 2024-08-21
drwx--S---  34 root root  4096 Aug 23 01:06 2024-08-22
drwx--S---  12 root root  4096 Aug 24 02:55 2024-08-23
drwx--S--- 130 root root  4096 Aug 25 03:11 2024-08-24
drwx--S--- 450 root root 12288 Aug 26 02:59 2024-08-25
drwx--S---  19 root root  4096 Aug 26 23:33 2024-08-26
drwx--S---  22 root root  4096 Aug 28 02:20 2024-08-27
drwx--S---  21 root root  4096 Aug 29 01:12 2024-08-28
drwx--S---  32 root root  4096 Aug 30 04:23 2024-08-29
drwx--S--- 222 root root  4096 Aug 31 03:00 2024-08-30
drwx--S--- 417 root root 12288 Sep  1 02:54 2024-08-31
drwx--S---  44 root root  4096 Sep  1 09:07 2024-09-01

So from my view a valid comparison would have been to check how long does it take to sort the date folders → sort the 2-letter subfolders → stat the pieces in unsorted order vs. don’t sort the date folders → sort the 2-letter subfolders → stat the pieces in unsorted order.
As after all of this I believe that most time was spent for sorting pieces and subfolders, I believe my suggestion to sort date folders will have little to no impact on performance. Therefore I stand by my suggestion.

But as you have brought up the matter of performance I want to share what I have read through:
While I have agreed above, that sorting can have a negative performance impact on a large number of files vs. unsorted, this does not necessarily mean that unsorted is always the best and fastest choice.
The sources I have found even suggested, that sorting I/O requests in a specific way could lead to a better performance. So I think it is worth to have a look at it. (Maybe also something for @elek)

As inodes are involved in the deletion process, there are suggestions that sorting I/O requests by inodes would be faster. For ext4 I can say, this could be true as this would make use of the inode caching and read-ahead feature.

Here is an example ls output of the date folders sorted by inode
ls -1i /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ | sort -r
593521316 2024-09-01
593143993 2024-08-31
593041899 2024-08-30
592231096 2024-08-29
591685746 2024-08-28
591225174 2024-08-27
590823958 2024-08-26
590438915 2024-08-25
589998827 2024-08-24
588997086 2024-08-22
588341255 2024-08-21
587938075 2024-08-20
586996106 2024-08-18
586416324 2024-08-17
585826424 2024-08-16
584452324 2024-08-14
582040054 2024-08-10
581416965 2024-08-09
580610220 2024-08-08
579915994 2024-08-07
579395735 2024-08-06
578906606 2024-08-05
578451515 2024-08-04
578113589 2024-08-03
576301641 2024-07-31
575480158 2024-07-29
574257524 2024-07-26
572881041 2024-07-24
572319751 2024-07-23
571675244 2024-07-22
571054663 2024-07-21
568348944 2024-07-15
566890923 2024-07-13
566321630 2024-07-12
563407269 2024-07-08
562745426 2024-07-07
561930283 2024-07-06
561478332 2024-07-05
117237791 2024-08-23

I am not sure sorting folders by inodes would gain much. But where it could make a huge difference could be on the deletion of the pieces. There could be 2 approaches either recursively sorting all pieces in a date folder or by subfolder.

ls output with pieces sorted by inode number
ls -1i /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-07-06/do | sort  | head -n10
 100598232 iklr4tggcyvftsvj4skvfiswhpnobgf2ee3bowdazvu5zaqnhq.sj1
 100629994 be56okexdkglhokj7ame6ipf7tuikw4zywjikq65wisvqebv2q.sj1
 100630524 fxj3qkwzcmyz3wa3urivexppshm75cgluphknrmarums7ype7a.sj1
 102506879 mfey6vdxfqr56uq75vrvskwuajhtl2vijsoozxy7t6oyuduqoa.sj1
 102507044 3xnbdrxc7ysfxyrg5zz4uoukhglnlkj6jmnyayjf2lmdbwj5qq.sj1
 102510611 ng7vnin6snnbcwieqb3iimyjidqsc5ofmjwf6lfltlkwwwcarq.sj1
 102514718 wlfmyzlyynjrefroxvcthbhm3uzf6ejvieyfdlkingb5e3ghuq.sj1
 102519379 tdp47osqmzinfhbmcbqbbfkyvrv5docg37gwoez5jegcfrfz4q.sj1
 102519706 cyihkwa5eam2uyx7q3fd7biouxscdgarf3mbz36gqkbr7m6pva.sj1
 102522728 iwwi6n3oxhivwnnowpvfpo7zie5fiyz7cbpvnandlydkjscica.sj1

This might maximize the use of inode caching capabilitites and prefetching of the ext4 file system and reduce seek times of the underlying hardware.

These are my references which suggest exactly that, that sorting of IO requests can speed things up a lot. They go even beyond inode sorting and suggest sorting by physical proximity:

Questions and comments: performance - Read files by device/inode order? - Stack Overflow

Comment: performance - Read files by device/inode order? - Stack Overflow

it has been shown that userspace sorting of files by inode and physical location does improve performance by large integer factors (of course dependent on file size, more so for small files). See this paper and presentation, which include useful graphs. They report that their qtar tar implementation using FIEMAP is 4x faster in tarring the Linux kernel source tree.

FIEMAP:

FIEMAP: FIEMAP is a kernel feature that can be used by applications through the fiemap system call. To use FIEMAP, an application must explicitly request it by calling the fiemap system call and passing the necessary parameters.

The fiemap system call is used to retrieve the file information, including the inode information, for a given file. The call takes several parameters, including the file descriptor of the file, the offset and length of the file range to retrieve, and a pointer to a struct fiemap structure that will contain the retrieved file information.

We can do better. Some file systems, like Linux EXT4, offer an API for fetching information about file extents: FIEMAP ioctl. We can use this API to get a data structure that contains information on the physical placement of the file data. Then, the physical placement of the beginning of the data can be used to sort the files so that we can process all files in a single sweep. A great news is that this API is also available for non-root users.

The order of file I/O requests has a tremendous impact on I/O performance on spinning drives. If your application needs to process a batch of small files, make sure you request them in the same order as their physical placement on disk. If you can’t do it because your file system or your operating system does not provide physical block placement information, at least sort the files by their identifiers.

And here is a general paper that was referenced: Improving File Tree Traversal Performance by Scheduling I/O Operations in User Space
(I had already linked to another paper of the same author in the past: Improving Disk I/O Performance on Linux)

I don’t know, but maybe the badger cache could be used to cache inode numbers additionally to file sizes when it stats the pieces and maybe then this could be used to do sorting on folders and pieces in a more efficient way so it makes sense and gains performance.

To sum it up: I do believe the processing order of the date folders should follow FIFO and not leave it randomly on whatever OS is in use. The oldest folder should get processed first. This is straightforward, logical, easy to follow and does not leave clutter or half emptied folders behind. Inside the date folders it should be checked, if orderdering by Inode, physical location or use of FIEMAP could speed up the deletion process even beyond the current use of unsorted I/O requests.

I didn’t find any sorting in the code (but I maybe wrong), so the retuned order depends on the OS and the filesystem. Like for example we already know, that Windows and NTFS sorts files automatically by default.
Linux and ext4 seems not. Sorted data inside the two letters folders likely a coincidence.
I’m against making code more complicated when it don’t gives any profit. The only profit which I can imagine - is to delete the expired data faster. The sorting will not help with that.

We shouldn’t spend developers time on what is not worth it. The amount of deleted data will not change with your suggestion to sort pieces and/or folders, sorry.

This is mean that we need to specialize storagnode specifically for Linux and ext4. What would you suggest to support that in BTRFS, NTFS, ZFS?

FILEMAP is too specific for some kernels, too.

I think the better approach would be to extend the badger to be used for blobs, but it should be stabilized first, since we have had a corrupted badger cache in some setups. However, it could be faster on deletions too.

By the way, does the node deletes the trash, or still cannot keep up?

It would be a strange coincidence as I have shown, there is a difference for the 2-letter subfolders between unsorted order presentation and processing them, which is sorted in some way.
Of course I can only observe and this is what I am seeing and I have never observed it different than that.

What could that mean? Maybe that there is no significant performance impact for sorting it as suggested. But for sure my suggestion if implemented would not change anything for the bad for Windows nodes.

I cannot say how much effort would be required to implement my suggestion. Only analogy I have is a database like MySQL or something. If it is similar trivial like adding an ORDER BY clause to a MySQL statement then I don’t see much time and effort wasted.

I am not a coder, I don’t know anything about BTRFS and NTFS and if they have features that could be utilized for improvement.

Judge for yourself. I am seeing similar situation on several nodes. This is just one example:

ls -1 /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
2024-06-19
2024-06-25
2024-06-28
2024-07-04
2024-07-06
2024-07-07
2024-07-08
2024-07-14
2024-07-20
2024-07-23
2024-08-05
2024-08-06
2024-08-16
2024-08-17
2024-08-25
2024-08-26
2024-09-02

And this is the order they supposedly will get deleted (if ever):

ls -1U /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
2024-08-17
2024-08-25
2024-08-16
2024-08-26
2024-07-06
2024-07-20
2024-06-19
2024-08-06
2024-06-25
2024-07-14
2024-08-05
2024-07-23
2024-09-02
2024-07-08
2024-06-28
2024-07-04
2024-07-07

Why “if ever”? Because if you leave it to the file system/OS there is no guarantee that it will not place a more recent folder before older ones as you can see where folder 2024-09-02 is in line before 2024-07-07.
With my suggestion, there would be guarantee that old folders will be deleted as they would be always the first ones that the deletion process would be working on. The oldest date folder on the node above is approaching 3 months old, which is not what it could be.

That it’s implemented not on the user layer, but likely on the system layer. The Linux kernel and ext4 implementation seems doesn’t have such defaults.

maybe but why, if it doesn’t make any difference to speed up a deletion? This is the only a measurable item here, all other are irrelevant. Expired data should be and will be removed eventually, but not faster that the underlaying setup can perform.

I get it a long time ago, but why is it so important? The deletion speed of the expired data will not change, so why to bother with the sorting either which will add a latency without a doubt? If the metadata is removed from the satellite’s database, there is no way to make any assumption what is that expired piece and to which segment it belongs, it’s a piece of encrypted data, from any point of view it’s just a random data, the digital trash, which should be deleted and will be, eventually.

1 Like