When will "Uncollected Garbage" be deleted?

elek · August 21, 2024, 1:46pm

It’s allowed to delete faster… Just the default values are conservative to avoid overloading the storagenodes…

EasyRhino · August 21, 2024, 4:40pm

thank you for the research and updates elek…

Is that controlled by these settings in config.yaml or something else?


# storage2.delete-queue-size: 10000
# storage2.delete-workers: 1

edo · August 22, 2024, 8:12am

Hey there!

I’ve noticed the ongoing debate about whether this issue is real or just overblown. Some might still be brushing it off, thinking, ‘Hey, it’s still money in the bank!’ But let’s dive into what really happens when your node gets filled up with test data that doesn’t clear out:

Take a look at these two charts I whipped up—one with TTL enabled and one without. They tell a pretty interesting story. With TTL enabled, it feels like you’ve hit the jackpot early on, but as time goes by, that initial rush starts to slow down. Why? Because half of your space gets taken over by uncollected garbage, which—surprise, surprise—doesn’t pay the bills! In the long run, your earnings start to dwindle. Meanwhile, when TTL isn’t enabled, your node fills up more steadily. While the earnings might start slower, they’re more consistent over time.

So, unless you’re sitting on petabytes of storage, this test data could actually be holding you back from maximizing your earnings. For most SNOs, this means we’re missing out on storing fresh, paid data, and our payouts are shrinking as a result.

Alexey · August 27, 2024, 3:51am

10 posts were merged into an existing topic: Disk usage discrepancy?

elek · August 22, 2024, 8:06am

It’s more like:

--collector.expiration-batch-size int                      how many expired pieces to delete in one batch. If <= 0, all expired pieces will be deleted in one batch. (default 1000)

and

--collector.interval duration                              how frequently expired pieces are collected (default 1h0m0s)

elek · August 22, 2024, 8:19am

Updates on Uncollected Garbage related to Saltlake TTL data.

As I wrote earlier, it’s SLC / TTL data related. Not related to earlier problems which seems to be fixed or at least we have very good progress.
It’s a combination of multiple conditions, IF satellite doesn’t delete expired objects fast enough, and IF SN couldn’t delete expired objects, it can happen.

Fixes coming:

make satellite TTL deletion faster: https://review.dev.storj.io/c/storj/storj/+/14229
make satellite TTL deletion faster2: https://review.dev.storj.io/c/storj/storj/+/14228
make the piece deletion DB more robust on SN side: https://review.dev.storj.io/c/storj/storj/+/13789
delete expired objects with GC even if satellite was slow to delete TTL pieces: https://review.dev.storj.io/c/storj/storj/+/14272

We try to deploy 1,2,4 ASAP (no promise, as we don’t do deployment on Friday, but we do what we can do…)

edo · August 22, 2024, 9:09am

Thank you @elek for the updates. I realy appreciate it!

Does this mean that the current GC isn’t considering expired pieces?

If so, it could definitely explain why some expired pieces are still lingering on disk.

Thanks for looking into this!

agente · August 22, 2024, 9:10am

Is it possible to fix in the meanwhile with a local script? checking for expired objects and deleting manually?
Thank for the effort. In my full nodes I’m going to see paid data graph slowly decreasing…

edo · August 22, 2024, 9:14am

I’m guessing it might be a bit tricky to fix this with a local script right now. Since the node no longer has information about expired pieces (they’re gone from the pieces expiration database), a script might not have the full picture to handle it.

Vadim · August 22, 2024, 9:25am

If I understood write, one of the @littleskunk study, one of the problem that pieces are for X reason even not get to TTL table. so deleting it like that is bit risky.
As I remember it was patched, but not rolled out yet. may be 1.112

Jantoku · August 22, 2024, 12:19pm

As I understood the code, by default expired pieces (so pieces with TTL) are also collected by the GC/bloomfilter at the moment.
Once the TTL delete/expiry if fixed/reliable, this change will allow the TTL pieces to be excluded from the bloomfilter as it won’t make sense to have 2 process trying to clean up the same stuff and giving a better chance for bloomfilter to leave less garbage behind, without increasing the bloomfilter size to the sky

elek · August 22, 2024, 1:08pm

Current GC doesn’t differentiate between expired and active pieces, because an other chore supposed to delete the objects/segments earlier. Normally it’s fine, but if the other chore is not fast enough, it’s a problem. That’s why we need fixes on both sides…

EasyRhino · August 22, 2024, 4:26pm

ah, thanks. my config.yaml is older and this line didn’t even exist.

nerdatwork · August 22, 2024, 6:23pm

Alexey · August 23, 2024, 3:38am

This is normal, because the setting has a default value (1000 in this case).
You may see all other options by the command

storagenode setup --help

elek · August 23, 2024, 9:20am

Update: Patch release has been deployed. The SLC GC BF generator runs in ~1.5 days, and the previous version is still running.

It will start a new cycle today, I expect the next SLC BFs (which excludes expired pieces) sent out by the end of this week / early next week… (without error, the generated supposed to be finished by Sunday… )

Meantime the cleanup of the expired pieces are scaled up…

agente · August 23, 2024, 10:52am

it was becoming a big problem. Thanks for the effort.

Roxor · August 23, 2024, 11:14am

Does this mean the BF creation process now always includes a certain amount of time (like maybe a week?) of data that should-have-been-deleted-by-TTL-but-maybe-wasn’t?

By including more data to consider… does that mean the BF has to be a bit more precise in defining what to keep… meaning the BF files may be a bit larger than they were before?

If so, it sounds like a good tradeoff. You’re trading a bit of BF file space… for more certainty that lost TTL data gets identified and deleted faster. Clever improvement!

elek · August 23, 2024, 11:40am

No. It will be smaller.

BF is a net to catch fishes. It should catch all the fishes, what we like, but we are fine to catch 10% all the other fishes what we don’t like, no matter how many fishes there are.

The size of the net is estimated based on the number of good fishes + probability to get bad fishes. Those were not changed.

we have a patch which excludes the expired pieces. It won’t change BF size, just putting less items to the BF.
We have 2 patches to scale up the deletion of expired pieces. When expired pieces are deleted, the overall piece count of the node is decreased, and the optimal bloom filter will be smaller.
It might be reasonable to exclude expired pieces from the calculated piece count too, which would make the BFs even more smaller (a discussion just started here: https://review.dev.storj.io/c/storj/storj/+/14297, thanks to your question )

BrightSilence · August 23, 2024, 1:27pm

I’m still a little confused about what determines the final bloom filter size. If the BF needs to include fewer files, why isn’t it smaller regardless of whether those other files no longer exist on the sat or are expired?

Maybe I could make an educated guess based on the rest of your post, but please correct me if I’m wrong.
The BF creation process requires you to specify the BF size beforehand. So you calculate which size would lead to 10% false positive rate based on the number of files. So when those expired files are part of the count you generate a larger BF than necessary.

Assuming the above is correct, does that mean those BF’s actually get a lower than 10% false positive rate?

And just to satisfy my own curiosity, what happens if there are 1000 files that need to be included, but you calculate the BF size assuming 500 files and thus try to generate a BF with too low a size? Would it fail to generate? Would it suddenly have a false negative rate? (Which would be really bad)

Ps. I appreciate the progress on this and you keeping us up to date!