When will "Uncollected Garbage" be deleted?

donald.m.motsinger · July 28, 2024, 8:32pm

I have a full 14TB node with 1.87TB trash and 7.82 TB “uncollected garbage” according to the earnings script. Most data is from saltlake and I have the feeling that it is mostly expired TTL data, which doesn’t get deleted.

I got a bloomfilter from that satellite today, but it only moved about 300GB to trash. I thought the bloomfilter would only leave about 10% behind? Something is seriously wrong here.

I have no “malformed database” and only 2 “bandwidthdb: database is locked” messages in my log this month.

kocoten1992 · July 28, 2024, 8:48pm

I remember this Two weeks working for free in the waste storage business :-( - #59 by BrightSilence

And this is my node:

BrightSilence · July 28, 2024, 9:20pm

Couple of notes. Uncollected garbage is calculated using the last reported disk usage from the satellite and the local blobs storage as calculated by the node. Both of those numbers can be off for several reasons. Satellites sometimes report incomplete data when their calculation is running behind. And there are still unsolved issues with local storage calculation because of long running used space file walkers. That last issue is expected to be resolved in v1.109. Furthermore TTL deletes can start running behind, leading to a backlog of old pieces needing to be deleted.

To be clear, I think all of this still contributes to an unsatisfactory experience for node operators at the moment, though things are getting better with every new version. Clearly the heavy load plus short TTL tests were a bit of a shock to the system and the kinks are still being worked out.

donald.m.motsinger · July 28, 2024, 9:39pm

There have been no gaps in the reports from the satellites in the last 10 days. The space is really occupied on my hard disk

df -H /mnt/storagenode4/
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/storagenode4-node   16T   16T   83G 100% /mnt/storagenode4

The node was really full with real data in the first half of the month and then the TTL kicked in.

How is this backlog handled after a node restart? it seems that there is no backlog on my node anymore after I restarted that node several times this month. And shouldn’t have the bloomfilter taken care of this today regardless?

Roxor · July 28, 2024, 9:49pm

If a node has files recorded in its local TTL database… perhaps that protects them from any action from bloom filters? I mean even if the node should have deleted them already… but didn’t… perhaps those files are still bloom-filter-proof?

If so, we need the node to either “catch up”… or delete those expired TTL DB entries so the filters can do their job?

(I don’t know how it works either: just talking out loud…)

donald.m.motsinger · July 28, 2024, 9:59pm

The piece_expiration.db has only entries from today onwards. Older dates get deleted it seems even though the pieces didn’t get deleted. In my oppinion the entries should stay in the database until the pieces really got deleted. This all seems to be a real mess right now.

BrightSilence · July 28, 2024, 10:02pm

TTL removal should just resume after a restart as far as I know. From what I’ve seen bloom filters are still based on week old snapshots, meaning they won’t touch data deleted or expired in the past week. With the 30 day TTL that means a quarter of your data from Saltlake isn’t even being considered for deletion yet. The size might also still be too small. You could try looking at what size bloom filter you received.

donald.m.motsinger · July 28, 2024, 10:09pm

Ah, that would make sense. The bloomfilter from today is from the 2024-07-22 snapshot.

But it seems that virtually no TTL data got deleted. The node had ~13 TB on 10/07 and now ~ 5TB. Thats exactly 8TB which should have been deleted but is still “uncollected garbage”.

donald.m.motsinger · July 28, 2024, 10:26pm

“Filter Size”: 25000003

BrightSilence · July 28, 2024, 11:50pm

Yep, that’s the current max. So apparently still not sufficient.

Alexey · July 29, 2024, 3:58am

The TTL data is not protected from the GC, moreover the GC is very often can move the TTL data to the trash before the expiration time. Of course, if that data was deleted by the customers before the expiration time.
See:

littleskunk · July 29, 2024, 5:38am

Well that is the root cause. How did your node lost all the TTL entries?

donald.m.motsinger · July 29, 2024, 7:50am

What do you mean by that? All my bigger nodes have this

sqlite3 /mnt/storagenode4/node/storage/piece_expiration.db "SELECT DATE(piece_expiration) AS expiration_date,COUNT(*) AS piece_count FROM piece_expirations GROUP BY DATE(piece_expiration) ORDER BY expiration_date;"
2024-07-29|1782
2024-07-30|6606
2024-07-31|23292
2024-08-01|72472
2024-08-02|38497
2024-08-03|707
2024-08-04|2196
2024-08-05|67
2024-08-06|348
2024-08-07|31
2024-08-08|7455
2024-08-09|1908
2024-08-10|4
2024-08-11|10660
2024-08-12|8397
2024-08-13|8413
2024-08-14|8856
2024-08-15|7043
2024-08-16|2900
2024-08-18|1726
2024-08-19|6093
2024-08-20|8102
2024-08-21|5487
2024-08-22|5264
2024-08-23|14688
2024-08-24|27183
2024-08-25|18390
9999-12-31|9750

It’s your software which does it. I have no errors in all my logs about database this month.

Alexey · July 29, 2024, 7:56am

Then I guess all 30 days old TTL data is deleted, except the data which will expire today.

littleskunk · July 29, 2024, 8:16am

Your TTL DB has entries from the entire 30 days range. That looks good so far. The files that should expire in the next 2 days have been uploaded 28-30 days ago. All fine.

I love this. Now I am the owner of the software. I mean it is open source and you could fork it at any time. So that statement doesn’t make too much sense but for the record I guess I can now call me the owner of this project

donald.m.motsinger · July 29, 2024, 8:20am

“Your” as in STORJ’s software. “Your” can mean “deine” or “eure” in German.

donald.m.motsinger · July 29, 2024, 8:26am

If all looks “fine” then why are virtually ALL expired pieces NOT deleted from my nodes and garbage collection has to take care of it?

Alexey · July 29, 2024, 8:45am

Why do you think that you have an expired TTL data?
Only based on the script’s output? It’s not necessarily the expired data. Accordingly records from your TTL database, all expired data except today is already deleted.
But your node may contain the uncollected garbage.

donald.m.motsinger · July 29, 2024, 9:08am

From my observations here When will "Uncollected Garbage" be deleted? - #4 by donald.m.motsinger.

The satellites report that I should have ~4.75TB yet there is ~14TB in the entire blobs folder. This node used to be on a half filled 6TB disk when I moved it to a 16TB disk just before the saltlake tests started. It filled within 2 weeks of the tests, so it’s safe to assume that most of the data is from the tests from saltlake.

Wasn’t there only TTL data uploaded during these tests? That’s how I come to the conclusion that most of the uncollected garbage are expired TTL pieces.

BrightSilence · July 29, 2024, 9:09am

I mean, technically expired data is also uncollected garbage. It’s a shame the piece_expiration.db doesn’t list the size of pieces. If it did, I could list that separately to provide more insight.

However, I think it’s quite unlikely for such amounts of uncollected garbage to not be TTL data from the tests at this moment.

I wonder if it could be that there are scenarios where the TTL metadata isn’t there, but the pieces are. Like perhaps (incomplete) downloads that do het committed to disk, but never add the TTL record under certain circumstances. Or something in the TTL cleanup process that removes the TTL metadata without removing the files.

It might just be that TTL cleanup takes a long time, but based on my own nodes and user reports, I suspect something more is going on. Especially since @donald.m.motsinger reports that there are no old TTL records, while still having huge amounts of uncollected garbage that could only have been generated by these recent tests.

GC with insufficiently large BF’s then only compounds the problem if the backup system can’t efficiently clean up the remaining pieces.