When will "Uncollected Garbage" be deleted?

Ambifacient · August 25, 2024, 11:47pm

According to Grafana I am moving pieces to trash at a peak effective rate of 2Gbps for the past 12 or so hours. Have accumulated ~14TB of trash so far, you love to see it.

NPrincen · August 26, 2024, 12:26am

Congratulations in solving the problem of SLC TTL data not being deleted! I currently have 3.61TB of trash showing on a 16.81 TB node. This is a great start in cleaning things up to make room for some paying data. The only question I have is why this data didn’t just get deleted directly? It is not like it needs to hang around a week in the trash for “safety” reasons like paid data. It was supposed to be automatically deleted TTL data anyway and should have immediately been irradicated. Now it is hanging around for another week before really getting deleted. Just a thought for further improvement in the process. Thank You!

Alexey · August 26, 2024, 3:25am

Do you see a high CPU usage? The new BF from Saltlake could be more efficient than before and can move to the trash almost all expired data, which were missing in the TTL database.

Alexey · August 26, 2024, 3:34am

Hello @NPrincen,
Welcome to the forum!

Because it was not registered in the storagenode TTL database when this data was uploaded to your node. This problem should be fixed with this change:

With other changes, this orphaned expired TTL data should now be collected by GC regardless of how quickly TTL records are removed from databases on the satellite side.

pasatmalo · August 26, 2024, 4:01am

Does this mean that TTL collector deletions do not update the cache?

If so, this sounds like it might be a bug as the cache would ideally be updated whenever a TTL piece is deleted to avoid unnecessary records in the cache and avoid additional IOPS (trying to delete cached files that have been deleted by the collector instead of retain).

Alexey · August 26, 2024, 4:26am

What kind of cache? Your node executes a TTL collector every hour by default, which removes expired pieces directly.
When your node receives a Bloom Filter from the satellite, it start collecting pieces, which should not exist accordingly to this BF and then retain process moves them to the trash. So, to the moment when the retain would try to move the piece, the TTL collector may remove it already.
You may also see a reverse situation, when the GC will be quicker than a TTL collector, then the TTL collector would complain about the missing piece. This usually happen if the customer deleted the TTL data before its expiration.

These two filewalkers are independent of each other, so they do not interact with each other to avoid slowing each other down. They also do two different things, the TTL collector removes pieces, the GC+retain moves pieces to the trash.

anto294 · August 26, 2024, 7:39am

Finally!

dancekid · August 26, 2024, 7:41am

Hello fellow Storlings,

i have 20,8GB Trash now

Now we need more Customer Data

Greetings from Germany
Michael

pasatmalo · August 26, 2024, 1:49pm

I might be wrong here, since Im just assuming, but this sounds like the kind of error the badger cache might be producing if TTL deletions are not removed from it.

@edo In the node where you received those errors, did you have the badger cache enabled?

Edit: It appears that the badger cache was not enabled, so what I thought was broken would no longer be the case. I still wonder how it can be the case that retain attempts to delete data that has been deleted by TTL, as if the files are deleted the retain file walker should not find it in the system. One possible cause is that TTL is able to delete files from trash. GC may have moved it to trash, TTL deleted it from there, and then after a week they could not delete the file from trash because TTL took care of it. I am not familiar enough with the logs to know if failed to trash piece refers to moving the file to trash, or deleting it from trash.

andrew2.hart · August 26, 2024, 4:35pm

I’m not worried, my used space is now below the satellite report and my guess of 5 TB was too low

edo · August 26, 2024, 5:05pm

Hi @pasatmalo, no, I didn’t have badger cache enabled on that node.

digitalfrank · August 26, 2024, 8:17pm

In this moment i have 30 TB of Trash on 200TB allocate. It’s ok and the nodes run perfectly. Thanks

Ps. Now i’m wait the data from clients now.

MarviBiene · August 26, 2024, 11:12pm

Its working perfectly right now. The cleanup takes really long to get the trash from the past months, but it gets there eventually. Its Still increasing about 20Gb every 10 mins. Lets look how long this will take.

Thank you StorJ Devs for fixing this error. You are doing a really great job.
Some round of applause for you

kocoten1992 · August 27, 2024, 1:47am

I can confirm it working now, but due to the amount of uncollected garbage. Some of my node, it work slowly:

# this is about 18h of running, it only get to ck dir yet (it run alphabetically)
# I estimate it would take a week a so to complete this bloom filter
saltlake.tardigrade.io:7777 (GC @ Folder:ck Date:2024-08-19)

But, much appreciate! Thank you for working on this!

Alexey · August 27, 2024, 3:39am

Pretty simple, these are two independent processes. The TTL collector deletes pieces from the filesystem and their records from the TTL database.
The garbage collector+retain consist of two parts: the garbage collector which processing pieces accordingly to the sent Bloom Filter and remembers which pieces should be moved to the trash (in the private RAM allocation, it’s not the cache). Then it run a retain process after some amount of collected pieces (1,000 by default, --collector.expiration-batch-size). But at this time the remembered piece could be already deleted by a TTL collector.

no. If the TTL collector wouldn’t find a piece, it will print a WARN message, that the “file is not found” and go further. This is often may happen, if the customer deleted the TTL data before the expiration date. In this case it will be moved to the trash by the GC, but the TTL database will not be updated (and we likely will not change that, otherwise we would get a “database is locked” error, because this database could be opened by the TTL collector already, SQLite does not allow simultaneous access from different processes). So when the record in the TTL database will pass the expiration date, the TTL collector would try to delete this piece directly and then removes the record from the database. WARN is not considered as an error, so the record will be removed in that case too (to do not have orphaned records).

mikee027 · August 27, 2024, 12:06pm

Interesting development…negative uncollected garbage?

BrightSilence · August 27, 2024, 1:54pm

This can happen when data has expired and was cleaned up locally after the last report from the satellite came in. Satellites only report about once a day, so that stat might be a little behind, causing this small mismatch. As long as the number is relatively low (positive or negative) you shouldn’t worry about it.

TWH · August 27, 2024, 2:05pm

My nodes regained a lot of garbage space as well. Garbage collection has finished on all of them and i still have about a 1TB of uncollected garbage at the moment on one of my bigger nodes. Since the bigger node had 10TB of trash added this seems to be in line with the 90% accuracy rate of the bloom filters. Is anyone else seeing a 90% garbage collection rate? I am wondering if this is the bloom filter missing some of the data or if the garbage data is from something else prior to the TTL data, the node is over 2 years old. Now that this major issue has been resolved to a usable level, i am planning on continuing to grow my nodes as they fill. Thanks devs for fixing this major issue.

Untitled

donald.m.motsinger · August 27, 2024, 2:18pm

How did your GC already finish? My 14TB node with an expected 10TB uncollected trash is at folder f6 (192 folders out of 1024) right now.

At that speed the trash cleanup will start before the GC finished.

Vadim · August 27, 2024, 2:25pm

So far i have 55 TB cleaned from 370TB