Next trash problem: Trash not deleting

jammerdan · May 20, 2024, 4:00pm

I am seeing retain progress on the 2024-05-18 folder

ls /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-05-18
aw  ax  ay  az

It has moved some data into the subfolders and created new subfolders. So it seems to work (slowly).
But trashing does not work. In my opinion the folder dated 22-04 should not be there anymore and the numbers are still not decreasing.

jammerdan · May 21, 2024, 3:22am

No progress in removing pieces:

ls /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-04-22/vl |  wc -l
9518

Question:
As @Alexey has suggested to wait until retain is completed, is it supposed to remove any trashed pieces while retain is running?
I could restart the node and if the deletion of bloomfilter is not fixed yet thenit should remove them and we could see if trash removal resumes.
Or of course I could remove the bloomfilters manually:

ls /config/retain
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1715968799993818000  ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1714499999998475000
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1716130689348472000  v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1715795999926765000
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa-1715795999280959000

revyte · May 21, 2024, 3:58am

You could copy the filter-files elsewhere, and restart node. They will get deleted and you can try if trash gets deleted. After that shutdown the node and copy filters back and start node again. It will continue retain process where if left off.

jammerdan · May 21, 2024, 4:03am

Yes I could try that. I just want to know first, if it is expected that trash deletion halts while retain is running. If that’s the case then it would be perfectly normal that trash does not get deleted. I don’t think it should be halted, but I am not sure.

Alexey · May 21, 2024, 6:07am

It’s not expected, but probably on a slow nodes they could have a contention:

jammerdan · May 21, 2024, 6:21am

Ok then let’s move the bloomfilters out of the way, restart and see what will happen, ok?

Mitsos · May 21, 2024, 6:35am

What I’ve gathered so far, for the past couple of weeks since I’ve never really monitored my nodes this closely:

BFs come in, get saved to disk and are processed one by one:

pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1716130689348472000
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa-1716141599172348000
ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1715795999994095000
ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1715968799997895000
v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1716141599968771000

Trash cleanup doesn’t delete prefix directories until the entire run is done. Trash cleanup starts working on the oldest date directory first, goes through the prefix directories with the letters first, then the numbers (ie: 2024-05-03/aa before 2024-05-03/22). When all of the prefix directories are cleared (show no files in them), only then they are removed. Haven’t caught this while it’s happening so I can’t tell if it does rmdir one by one or not.
If the lazy filewalkers are enabled, then they can only work with “spare” IOPS. Given that there is testing being done, that takes a higher priority. As a result, multiple filewalkers can be running at the same time (ie used+gc+trashclean all at the same time). A slight comment here: it would be perfect if there was some sort of detection that another filewalker is running (ie touch a .trash-cleanup) file somewhere and check for it. On node startup remove this file. IMNSHO, the filewalkers should be run in this order: first trash-cleanup, then gc, and only if neither of those is running used-space. There is no point in iterating through the entire satellite’s directory if those files will be moved to trash or if they are going to be deleted. node restart > .trash-cleanup touched > other filewalkers wait > .trash-cleanup removed > .gc-running > used-space waits > .gc-running removed > used-space starts. Regardless, I have not seen any issues with them running concurrently. They are all slowly but surely working through directories (verified with lsof multiple times).

There are significant improvements to be made by not having the filewalkers trample on each other. Cache data is evicted if the host is low on memory, so it’s better if that data is left for normal download/upload instead of evicting that because trash-cleanup is asking for metadata (so cache them), but gc is also asking for metadata (so evict trash-cleanup’s metadata and cache gc), but used-space comes along (so evict both trash-cleanup + gc). used-space can be last to finalize cache warmups for normal downloading/uploading.

jammerdan · May 21, 2024, 6:39am

You mean the retain run must be finished before the trash starts deleting?

Mitsos · May 21, 2024, 6:43am

No. I mean all files in all 2024-05-03/* must be deleted before that directory (2024-05-03) or any of its subdirectories (2024-05-03/aa) are removed.

It doesn’t go deleting 2024-05-03/aa/all-files, rmdir 2024-05-03/aa, then deleting 2024-05-03/ab/all-files, rmdir 2024-05-03/ab and so on.

snorkel · May 21, 2024, 7:09am

So if lazy is so problematic and we got such speed improvements in 104, why keep it on? Turn the lazy off! End of discussion.

Mitsos · May 21, 2024, 7:12am

I didn’t say it’s problematic, on the contrary, if we didn’t have lazy then we couldn’t run all three filewalkers at the same time and sustain this testing.

snorkel · May 21, 2024, 7:22am

I run 2 nodes on an old Celeron with 1GB RAM, without lazy. No trash problems, or any problems. But I disabled the startup filewalker because it must be run once a year or less.

Mitsos · May 21, 2024, 7:58am

Only enabled used-space because I had 20TB of unaccounted trash. Solved with 1.104, but some nodes are still running it.

Anyways, I support lazy because normal usage should be priority and filewalkers can slowly go through at their own pace. If the node is full (and there is no more data coming in) then most of the IOPS would go to lazy anyway.

ACarneiro · May 21, 2024, 8:58am

You make a compelling argument. I think I will enable lazy on my spinning rust nodes, which are more likely to suffer from IOPS starvation. Doesn’t seem to affect the SSD ones that much.

jammerdan · May 22, 2024, 5:49am

Ok but in my case I have a trash date folder from 22-04-24 wile the current retain is working on 18-05-24.
Currently I don’t see any trash chore in my logs trying to delete the old folder.
2 satellite folders have oldest date folder at 16-05, the US-1 and the EU-1 have their oldest date folder at 22-04 resp. 11-05.
So the trash removal seems to work, but just not for these two satellites.

And even with this

when we receive bloomfilters more often it should not stop the trashing of old data. That’s data that is over a month overdue to deletion.

BrightSilence · May 22, 2024, 7:01am

I was running 14 file walkers on a Synology early during testing. And unfortunately the lazy file walker doesn’t work with lower priority on Synology. So I turned it off and it ran just fine with all of them.

Alexey · May 22, 2024, 7:09am

does du -s --si for this folder shows not zero?

Mitsos · May 22, 2024, 7:09am

As I have said in a another reply about a different topic: lazy only works if the host sees the actual request for a lower priority. If running in a VM for example, that request never hits the host OS, so that host can’t properly schedule IOPS.

I don’t know how synology does it, maybe it’s shielding docker from the host or something.

It could be that the process is still busy with that directory since that got returned to it first?

jammerdan · May 22, 2024, 7:34am

The 22-04 directory you mean?
Well there is no progress on that still

ls /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-04-22/vl |  wc -l
9518

Mitsos · May 22, 2024, 8:02am

No, I meant if the 18-05 got returned to it faster for some reason (ie checked a different satellite, started working on its subdirectory).