When will "Uncollected Garbage" be deleted?

littleskunk · July 29, 2024, 4:16pm

Your TTL DB doesn’t contain any old entries. So we know the TTL cleanup is working as expected and not falling behind.

This leaves 2 other options on the table. The used space numbers on the dashboard are off or the TTL DB didn’t persist all the uploads in the first place.

Something I find strange is the low numbers you have in that DB. For comparison here is my node:
2024-08-21|777680
2024-08-22|1622570
2024-08-23|103405
2024-08-24|93795
2024-08-25|98219
2024-08-26|104809
2024-08-27|102702

donald.m.motsinger · July 29, 2024, 6:23pm

The numbers in the left graph is what the satellite(s) report and you keep saying that this is always right.

That would be a bug then?

Not that strange if you consider that the node was about full 30 days prior.

Alexey · July 30, 2024, 7:57am

I would suggest to calculate the size of the subfolder pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa in your storage/blobs folder.

Yes, but in the beginning the randomize of pieces names were not enough, so part of the data was overwritten, that’s mean, that it should be collected by the garbage collector.

Or your node might have had a database is locked issue that time, when this data is uploaded. However, I guess, that this data should be collected by the GC anyway.
Do you have BF in the retain directory in the data location?
My nodes are still processing them one by one and four more:

$ ls -l /mnt/x/storagenode2/retain/
total 19412
-rw-r--r-- 1 root root 6873494 Jul 28 09:30 pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1721930399960810000.pb
-rw-r--r-- 1 root root 6057592 Jul 28 04:23 ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1721843999999906000.pb
-rw-r--r-- 1 root root 6091593 Jul 30 02:06 ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1722016799998993000.pb
-rw-r--r-- 1 root root  847516 Jul 30 08:18 v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1722103199919940000.pb

donald.m.motsinger · July 30, 2024, 8:42am

12TB

zgrep "pieceexpirationdb: database is locked" /mnt/storagenode4/node/node.log-2024-06.gz |wc -l
1791

These 1791 pieces would not account for the 8TB garbage. No such errors this month.

No files in the retain folder. Last gc for saltlake ran on 2024-07-28 and finished after 7h.

Alexey · July 30, 2024, 8:54am

There can be more than 1791 pieces. It cached in memory first then flushed to the database. It’s unknown amount of missing records unfortunately.
So, seems the size of the BF

is not enough for your node to collect all the garbage…

donald.m.motsinger · July 30, 2024, 9:03am

Are you sure about this? Each error is for exactly 1 piece. Or do you mean that there could be more errors which did not get written to the logs?

2024-06-09T07:27:25Z    ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "SBA2IZMCY57MM3KSR4OF6VBHAMF4BHMSHQIZ7JSXXRCH4J5N4NDA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "109.61.92.82:54458", "Size": 3328, "error": "pieceexpirationdb: database is locked", "errorVerbose": "pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).SetExpiration:111\n\tstorj.io/storj/storagenode/pieces.(*Store).SetExpiration:584\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:486\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:519\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}

The main question remains, why did this huge amount of TTL data get not deleted upon expiration?

elek · July 30, 2024, 9:20am

Are you sure the report is valid? Do you really have the files? If the database includes space usage of deprecated satellite, the tool may not give reliable numbers…

donald.m.motsinger · July 30, 2024, 9:24am

Yes, du -hd 1 --si gave me 12TB for the pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa folder.

BrightSilence · July 30, 2024, 11:05am

This isn’t really possible. The earnings calculator only includes satellites with activity for that month and everything is then joined to those active satellites. Even if the data for inactive satellites is still in the db’s, it wouldn’t be included in the calculations. And if for some reason it is included, you would also see a record for that satellite in the bottom overview, which is not the case here.

edo · July 30, 2024, 12:01pm

Hi there,

It seems I have the exact same issue on my nodes: It seems TTL did not get deleted, while there are no apparent errors.

In the screenshot below, you see I store about 8TB of data (right side, reported by satellites and having no gaps), while I have 15.1 TB occupied on the hard disk (left side). So half of my capacity is occupied by trash it seems.

I also wonder when trash will be released by the BF. It looks like the BF from SL are not GC much lately:

Thank you.

agente · July 30, 2024, 1:12pm

I’m here to just inform you that I have same issue. If you need data pls give us a simple list of commands and I will post the results.
If not… just informed you that donald is not alone.

edo · July 30, 2024, 2:55pm

Can we please stay on topic and focus on how to move forward in solving this issue? It’s important to address the problem at hand and explore potential solutions. Let’s work together to find ways to improve and optimize the system. Your constructive input is appreciated!

Tempest · July 30, 2024, 7:46pm

Every file should be accounted for, my original node from the start has many files that were last accessed 2019. This seems to be forgotten data that’s taking up 9TB that I could be earning money on but instead my disk is full.

pangolin · July 30, 2024, 8:12pm

The bloom filter tells your node what data is alive from the satellites point of view. Although this comes with a delay and a false positive rate of 10%, no ‘forgotten’ data should survive forever.

AussieNick · July 31, 2024, 3:30am

My node currently looks like this.

It just dropped to ~250GB used on 29-07-2024 from the satellites. I still have 3.3TB used on disk. I’ll update if that number goes back up, but 250GB paid data out of 3.3TB is crazy! ZFS agrees that 3.29TB is used in that dataset.

How exactly does the Bloom filter work? Is it sending a filter that includes all the pieces that should be stored and deleting everything that isn’t in the filter? Or does it work the other way around? I know BFs can have false positives, but not false negatives.

edo · July 31, 2024, 6:39am

The Average Disk Space Used graph isn’t always reliable because some satellites can lag in reporting the used space back to the node. In my case, as you can see from the screenshot, the SL and US1 satellites haven’t reported any data in the last few days. You can verify this on your dashboard by selecting the SL and/or US1 satellites, which will likely show 0 space used.

Based on your screenshot, a more plausible situation would be that you have around 1.7 TB of stored data. This suggests that there may be approximately 1.8 TB of uncollected garbage stored, indicating a 50% loss in available resources, similar to what I’m experiencing.

jammerdan · July 31, 2024, 7:04am

This is around what I am seeing too: The value for monthly average is around half of the value that the node is showing as used space.

agente · July 31, 2024, 7:41am

same here with many nodes. 50% is the realistic spot of data paied

edo · July 31, 2024, 8:10am

As more Storage Node Operators (SNOs) report similar issues, it’s becoming clear that there’s a challenge with cleaning up used space, possibly related to the cleanup of expired TTL data. The key question remains: When will “Uncollected Garbage” be deleted?

For example, when I look at my graphs for the SL satellite only this month, my stored data dropped from 8.3 TB to 2.4 TB, a decrease of 5.9 TB. However, this data hasn’t been removed from my hard disks — I verified this by checking the actual disk usage rather than just relying on the node’s calculations. Over the last couple of weeks, garbage collection has only cleared about 1.5 TB, leaving 4.4 TB still unaccounted for.

Looking at the Bloom Filters (BFs) for SL, the last significant cleanup event was three days ago, and only 100 GB of garbage is currently in the trash folders. I’m wondering when I can expect the remaining 4.4 TB to be cleaned up by the BFs?

Let’s continue to share observations and work together to resolve this issue. Any insights or updates from the team on this matter would be greatly appreciated!

donald.m.motsinger · July 31, 2024, 8:20am

Saltlake seems to have reported stats on 28/7 the last time. According to your graph there is 1.5TB more data on your node.