When will "Uncollected Garbage" be deleted?

elek · August 23, 2024, 1:48pm

Probability will be changed.

Try out this calculator:

Use 10M / 4Mb / 1 hash. And change it to 2Mb. For me it shows that the false-postitivate rate will be increased from 26% to 46%.

Still it will match all the 10M (good fishes), but you will catch more bad fishes.

Let me try an other inaccurate metaphor:

You would like to send a vase to me. You need a box, the vase, and some packing peanuts to fill the gaps…

You need a box which is big enough, but not too big (because it would be too expensive), it should have enough space for the packing materials around the vase.

Bloom filter should be big enough to include the (hashed bits of) items, + some additional space for zeros to decrease false positive matches.

With smaller box, but with the same vase, you will have less space for packing peanuts… which increases the false positive rate (or the chance of a broken vase ;-)) )

alpharabbit · August 23, 2024, 3:30pm

In my slc blobs folder I still have over 50% of files older than 30 days. Shouldn’t TTL or BF have deleted those files?

elek · August 23, 2024, 7:25pm

Depends on the files. There can be old, immortal data, but new testdata is uploaded with TTL.

TLDR; wait for the next BF run(s), which will be way better. Check it on Monday.

BrightSilence · August 23, 2024, 7:36pm

Yeah, I had a brain fart with my previous post. Somehow I thought there would be a minimum size required to match all pieces. But it would eventually just degrade into 100% false positives.

The calculator is useful, thanks for that link. But the vase analogy didn’t really work for me. Regardless, I get it now. Though you didn’t mention whether my assumption here was correct.

Nor this question

I’m thinking yes to both, based on what you said and the calculator. But just wanted to check.

Alexey · August 24, 2024, 1:07pm

I would add - please, expect a large amount of deletions. I would say about 50% or more.

donald.m.motsinger · August 24, 2024, 5:21pm

I got a BF today, but the result is disappointing. Or will there be another BF on Monday?

2024-08-24T13:03:54Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-08-17T17:59:59Z", "Filter Size": 35000003, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-24T16:01:30Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 81848, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 37422647, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "2h57m36.019750481s", "Retain Status": "enabled"}

If the reports from SL are correct, then I have 0.73TB paid data from this satellite. The used-space-filewalker from yesterday reported 11TB data in that folder.

2024-08-23T12:18:36Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 11064824212224, "To
tal Pieces Content Size": 11027026300672}

I wish you were right.

Mark · August 24, 2024, 9:46pm

Does that mean the bloom filter was 35MB? Is that the Max size currently allowed?

A 35MB BF for 0.73TB of paid data? Is that reasonably possible? Seems strange.

That’s a lot of pieces. Hopefully the 0.73TB satellite report is just temporarily broken. Are you looking at the report for a single day or the “Average” displayed at the top of the graph?

So it sent 0.2% of your pieces to the trash? That doesn’t seem like very much, but I guess if the TTL collector is deleting files correctly there shouldn’t be much remaining for the BF to delete right?

I don’t currently run any SLC nodes. Is all SLC test data set for a TTL of 30 days? Just out of curiosity, Have you looked inside one of your blobs/xx folders for SLC to get an idea of how old your pieces are? Perhaps you could check to see if most of the files in one of those folders are less than 40 days old?

Alexey · August 25, 2024, 2:36am

Because it includes the TTL pieces which were not cleared yet on the satellite’s side.
It should be improved with the latest change, but I would expect the effect only on the next week.

Likely yes, if it’s significantly differ from a previous report (you should pay attention to the daily reports, not the average used space).

Yes, it’s expected behavior. However, since not all TTL pieces were registered in the TTL database, these orphaned pieces still should be considered as a garbage and should be excluded from the BF independently of were they deleted from the satellite’s databases or not, because they are expired already and should be removed anyway.
After the latest change

the GC should remove these orphaned TTL pieces too.

Mark · August 25, 2024, 3:15am

Interesting. I wonder how long the delay has been lately for the satellite to clear TTL pieces on its side.

Alexey · August 25, 2024, 3:34am

I do not have an access there, but I believe it struggles with these deletions of millions of segments, otherwise the GC would collect much more expired data, which was not collected by the TTL collector.

andrew2.hart · August 25, 2024, 5:43am

I got a TB of uncollected garbage “collected” last night

jammerdan · August 25, 2024, 5:51am

Same here. Some massive amount vanished. It looks like it has been moved to trash.

edo · August 25, 2024, 6:49am

I can confirm! I received a BF from SLC on one of my nodes, which added more than 400 GB to the trash.

But here’s the thing: this node originally had 2 TB of uncollected garbage, so a bit less than 25% has been trashed so far.

@Alexey and/or @elek, do we expect the next SLC BF to take out more of the garbage, or are the BF/GC issues still playing hard to get?

Fingers crossed we get this cleaned up soon!

Alexey · August 25, 2024, 6:51am

I think it should capture much more garbage than before.

edo · August 25, 2024, 6:57am

It did make a big jump, going from 60 GB to 400 GB—definitely progress! But when you stack that against the 2 TB of uncollected garbage, it still feels like there’s a bit of catching up to do. Do we expect the upcoming BFs to be even more efficient at taking out the trash, or was this the expected big cleanup BF already?

Alexey · August 25, 2024, 7:14am

It’s both. We implemented a change, which allows us to enable a feature flag to do not add an expired TTL data to the BF, so it would be moved to the trash on the nodes, and the BF should become more efficient (we always working on improvements in this area).
The latter means that these sporadically performed audits from the trash doesn’t happen so often to assume the bug somewhere. However, I would prefer to be more careful there, as some SNOs empty the trash from time to time.

edo · August 25, 2024, 7:33am

Just did a deep dive into the logs for this node, and a few things jumped out at me:

There are lots of retain errors popping up over the last few days. Plenty of warnings about files not being found—sounds like a classic case of ‘ghost pieces’ hanging around.

2024-08-22T08:54:29Z    WARN    retain  failed to trash piece   {"Process": "storagenode", "cachePath": "config/retain", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Piece ID": "6YPK6D4E4ZBQIJXNP6CZBFISX7IRHBVJTGP43NBCBZXXRNHHMQPA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:124\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:340\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).TrashWithStorageFormat:406\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:422\n\tstorj.io/storj/storagenode/retain.(*Service).trash:428\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces.func1:387\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).processTrashPiece:112\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).writeLine:99\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).Write:78\n\tio.copyBuffer:431\n\tio.Copy:388\n\tos.genericWriteTo:269\n\tos.(*File).WriteTo:247\n\tio.copyBuffer:411\n\tio.Copy:388\n\tos/exec.(*Cmd).writerDescriptor.func1:578\n\tos/exec.(*Cmd).Start.func2:728"}

I also spotted something quirky: the ‘Failed to delete’ messages are showing a negative number. I’m guessing that means it tried to delete a piece that didn’t even exist? How could this be the case?

2024-08-25T00:52:44Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 23678, "Failed to delete": -23678, "Pieces failed to read": 0, "Pieces count": 5633273, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "17m24.819728138s", "Retain Status": "enabled"}

I could not find a “Moved pieces to trash during retain” log event for SLC yet, while the process started a while back ago. Does it means it is still running? Fingers crossed more data will be trashed!

2024-08-25T03:38:27Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-08-19T17:23:28Z", "Filter Size": 9708635, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

Alexey · August 25, 2024, 7:37am

This is mean, that the TTL collector was faster

Perhaps a bug. It’s also usually comes with this WARN message:

It should, unless it’s failed somewhere.

edo · August 25, 2024, 7:41am

Thanks Alexey for the information!

I did a quick search for ERROR in my logs and didn’t find anything GC-related, so I’m cautiously optimistic that it’s still chugging along.

Alexey · August 25, 2024, 7:48am

You can be sure, if you would request /mon/ps on the debug port