I have a full 14TB node with 1.87TB trash and 7.82 TB “uncollected garbage” according to the earnings script. Most data is from saltlake and I have the feeling that it is mostly expired TTL data, which doesn’t get deleted.
I got a bloomfilter from that satellite today, but it only moved about 300GB to trash. I thought the bloomfilter would only leave about 10% behind? Something is seriously wrong here.
I have no “malformed database” and only 2 “bandwidthdb: database is locked” messages in my log this month.
Couple of notes. Uncollected garbage is calculated using the last reported disk usage from the satellite and the local blobs storage as calculated by the node. Both of those numbers can be off for several reasons. Satellites sometimes report incomplete data when their calculation is running behind. And there are still unsolved issues with local storage calculation because of long running used space file walkers. That last issue is expected to be resolved in v1.109. Furthermore TTL deletes can start running behind, leading to a backlog of old pieces needing to be deleted.
To be clear, I think all of this still contributes to an unsatisfactory experience for node operators at the moment, though things are getting better with every new version. Clearly the heavy load plus short TTL tests were a bit of a shock to the system and the kinks are still being worked out.
How is this backlog handled after a node restart? it seems that there is no backlog on my node anymore after I restarted that node several times this month. And shouldn’t have the bloomfilter taken care of this today regardless?
If a node has files recorded in its local TTL database… perhaps that protects them from any action from bloom filters? I mean even if the node should have deleted them already… but didn’t… perhaps those files are still bloom-filter-proof?
If so, we need the node to either “catch up”… or delete those expired TTL DB entries so the filters can do their job?
(I don’t know how it works either: just talking out loud…)
The piece_expiration.db has only entries from today onwards. Older dates get deleted it seems even though the pieces didn’t get deleted. In my oppinion the entries should stay in the database until the pieces really got deleted. This all seems to be a real mess right now.
TTL removal should just resume after a restart as far as I know. From what I’ve seen bloom filters are still based on week old snapshots, meaning they won’t touch data deleted or expired in the past week. With the 30 day TTL that means a quarter of your data from Saltlake isn’t even being considered for deletion yet. The size might also still be too small. You could try looking at what size bloom filter you received.
Ah, that would make sense. The bloomfilter from today is from the 2024-07-22 snapshot.
But it seems that virtually no TTL data got deleted. The node had ~13 TB on 10/07 and now ~ 5TB. Thats exactly 8TB which should have been deleted but is still “uncollected garbage”.
The TTL data is not protected from the GC, moreover the GC is very often can move the TTL data to the trash before the expiration time. Of course, if that data was deleted by the customers before the expiration time.
See:
Your TTL DB has entries from the entire 30 days range. That looks good so far. The files that should expire in the next 2 days have been uploaded 28-30 days ago. All fine.
I love this. Now I am the owner of the software. I mean it is open source and you could fork it at any time. So that statement doesn’t make too much sense but for the record I guess I can now call me the owner of this project
Why do you think that you have an expired TTL data?
Only based on the script’s output? It’s not necessarily the expired data. Accordingly records from your TTL database, all expired data except today is already deleted.
But your node may contain the uncollected garbage.
The satellites report that I should have ~4.75TB yet there is ~14TB in the entire blobs folder. This node used to be on a half filled 6TB disk when I moved it to a 16TB disk just before the saltlake tests started. It filled within 2 weeks of the tests, so it’s safe to assume that most of the data is from the tests from saltlake.
Wasn’t there only TTL data uploaded during these tests? That’s how I come to the conclusion that most of the uncollected garbage are expired TTL pieces.
I mean, technically expired data is also uncollected garbage. It’s a shame the piece_expiration.db doesn’t list the size of pieces. If it did, I could list that separately to provide more insight.
However, I think it’s quite unlikely for such amounts of uncollected garbage to not be TTL data from the tests at this moment.
I wonder if it could be that there are scenarios where the TTL metadata isn’t there, but the pieces are. Like perhaps (incomplete) downloads that do het committed to disk, but never add the TTL record under certain circumstances. Or something in the TTL cleanup process that removes the TTL metadata without removing the files.
It might just be that TTL cleanup takes a long time, but based on my own nodes and user reports, I suspect something more is going on. Especially since @donald.m.motsinger reports that there are no old TTL records, while still having huge amounts of uncollected garbage that could only have been generated by these recent tests.
GC with insufficiently large BF’s then only compounds the problem if the backup system can’t efficiently clean up the remaining pieces.