When will "Uncollected Garbage" be deleted?

donald.m.motsinger · August 1, 2024, 8:52am

Here are my numbers again

df --si /mnt/storagenode4
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/storagenode4-node   16T   16T  139G 100% /mnt/storagenode4

Last night I got another bloomfilter from saltlake and it removed a tiny amount again, even less pieces than last time

2024-07-28T09:30:36Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-07-22T17:59:59Z", "Filter Size": 25000003, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-28T16:43:51Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1236917, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 79587449, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "7h13m15.262952328s", "Retain Status": "enabled"}
2024-07-31T22:09:20Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-07-24T17:59:59Z", "Filter Size": 25000003, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-01T03:50:33Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 789332, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 78350518, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "5h41m13.2117861s", "Retain Status": "enabled"}

At this rate it will take months to clean it up.

edo · August 1, 2024, 9:01am

Thanks for sharing! I noticed the same with a recent BF from SL, which only marked 25.4 GB as trash. It seems we’re both seeing lower-than-expected results. I’ve added a graph showing that the total amount of collected trash for SL is relatively small compared to the TBs of data it should be catching.

Hopefully, this helps the team pinpoint the problem.

elek · August 1, 2024, 9:19am

If you use prometheus, you can do sg. like this:

rate(download_success_count{field="value",action="GET"}[30m]) / rate(download_started_count{field="value", action="GET"}[30m])

elek · August 1, 2024, 9:35am

We constantly increase the BF size to make it more efficient + we increased the frequency of BF generation.

With high number of pieces, BF may be less efficient (with more than 10% false positive rate), but I would expect high number of pieces to be deleted… (and more frequent BFs should help a lot).

If you think that BF doesn’t delete enough data, please fill this form:

(It’s really hard to debug anything without the NodeID + number of piece files in the blobs directory)

Julio · August 1, 2024, 9:36am

Moreover, it could soon endanger the production data & it’s capacity/performance. I’ve called it the mosh pit within the mosh pit in another thread, following the various bugs. lol. However, on the majority of my nodes, I see some nice cycles of deletions and replacement egress daily, so that’s a good thing.

2 cents

donald.m.motsinger · August 1, 2024, 9:37am

I don’t use prometheus and as I wrote in another post, the data doesn’t come from cancelled uploads which were stored regardless. The satellite itself reported until mid July that I had 11TB of data e.g successful uploads.

BrightSilence · August 1, 2024, 10:42am

I just saw something that may be relevant to this topic. I’ve been seeing tons of errors for files that don’t exist for the TTL cleanup. But I always figured those may be on slow nodes and GC got to them before TTL could. However, just now I saw it run on one of my spare nodes, which is tiny. TTL collection was done within seconds, but still more than half of the TTL deletes gave a “file does not exists” error. This node is tiny and has been full for ages. It’s also on an SSD (I didn’t have another purpose for it, I know it’s overkill). All scores are at 100%.

So in short, it can’t be because TTL cleanup is taking a long time. Bloom filters are usually quite a few days old, so there is no way they cleaned up the data before TTL had a chance. Is it possible there are scenarios where the TTL records don’t get deleted from the db? Edit: nope, no old records in the db. Perhaps just data being deleted or overwritten before TTL and GC doesn’t remove those records?

jammerdan · August 1, 2024, 10:52am

It seems so:

donald.m.motsinger · August 1, 2024, 11:06am

But this would be the opposite of the problem here, TTL data deleted but still in the DB. We have the problem that the pieces are still physically on the hard disk.

MarviBiene · August 1, 2024, 8:39pm

How many pieces are deleted during last GC run? (you can copy paste your log line, which contains the satellite id and “Moved pieces to trash during retain”)

I dont have that log entry for the Saltlake entry:

root@debian:/storagelogs/node1# grep "Moved pieces to trash during retain" node.log
2024-08-01T09:24:14Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1284068, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 13735223, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "9h3m32.291182469s", "Retain Status": "enabled"}
2024-08-01T10:30:59Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 70930, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1792715, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "1h6m44.235295158s", "Retain Status": "enabled"}

PS C:\Users\marvi\Desktop\pve_backup\node.log.2024-07-30> findstr "retain" .\node.log.2024-07-30
2024-07-28T22:11:40Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 415318, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 2078408, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2h59m4.607256603s", "Retain Status": "enabled"}

these are the last entrys and the only ones

this is the same node. just downloaded the backed up storage logs till 2024-07-28T22:00:32Z

Should i fill the form anyway or should i open a thread for that? I will look on my other nodes as well

The other nodes are missins Satellites too:

root@storj:/storagelogs/node1# grep "Moved pieces to trash during retain" node.log
2024-07-31T23:37:26Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 32254, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 6683048, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "1h5m7.24031962s", "Retain Status": "enabled"}
2024-08-01T06:15:22Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 4216, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 3934617, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "33m19.046525471s", "Retain Status": "enabled"}
2024-08-01T06:26:58Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 10461, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 678067, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "11m35.988002329s", "Retain Status": "enabled"}


root@storj:/storagelogs/node2# grep "Moved pieces to trash during retain" node.log
2024-08-01T06:31:57Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 9175, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 946157, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2h15m12.37053044s", "Retain Status": "enabled"}
2024-08-01T13:19:53Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 26189, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 6725174, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "6h47m55.833647456s", "Retain Status": "enabled"}


root@storj:/storagelogs/node3# grep "Moved pieces to trash during retain" node.log
2024-08-01T09:20:17Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 55017, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 10246479, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "11h47m40.70544791s", "Retain Status": "enabled"}
2024-08-01T13:56:57Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 25247, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 6750853, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "4h36m39.525395992s", "Retain Status": "enabled"}

root@storj:/storagelogs/node4# grep "Moved pieces to trash during retain" node.log
2024-08-01T03:51:10Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 28608, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 6369103, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "6h20m49.033505612s", "Retain Status": "enabled"}
2024-08-01T06:12:35Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 21279, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1011287, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2h11m9.80534235s", "Retain Status": "enabled"}

donald.m.motsinger · August 3, 2024, 2:54pm

I bought a 16TB disk just before the tests started und was thinking about buying another one after it filled within 2 weeks. But then I thought “hold back and wait until one of the big customers signed”. And oh boy was I right to hesitate. I have now a total capacity of 34TB of which 8.4TB are real data, 2.4TB in trash and 23.44TB uncollected garbage

Ambifacient · August 3, 2024, 3:13pm

I would double check that number, it seems today only AP1 and EU1 are reporting the real data usage to the nodes.

Plus if you are on v1.109.2 any node reported disk usage number will be inaccurate.

donald.m.motsinger · August 3, 2024, 3:16pm

Negative, have a read here When will "Uncollected Garbage" be deleted?

It’s all expired TTL data which didn’t get deleted and GC only picks up a few hundred GB on each run.

I got the numbers from @BrightSilence 's earnings script.

Ambifacient · August 3, 2024, 3:19pm

That earning script relies on the “disk average usage” values to compute what is “real data” which is not correctly being reported by the satellites today.

Roxor · August 3, 2024, 3:28pm

The earning script is very useful: but the “uncollected garbage” label it uses confuses a lot of people. The script has no idea what that data is. Perhaps some of it is garbage: but then it has no idea how much is garbage.

That’s because the script is distributed knowing daily node summaries from Satellites have been unreliable. Yet it continues to represent the label and quantities with certainty. Add a disclaimer or something

pangolin · August 3, 2024, 3:35pm

I wonder how long could customers need to make a decision? I mean it’s now more than 3 month since they announced those big deals ahead and nothing materialized.

donald.m.motsinger · August 3, 2024, 4:02pm

I think the script ignores the dates when satellites don’t report and uses the last successful report. My dashboard says 1.14TB average this month for my 14TB node.

Yet the earnings script says 2.2TB disk average this month wich matches the graph on 1st of August.

Take a guess what percentage of my data is not from saltlake. It’s negligible.

My biggest node of 14TB has 11 to 12TB in saltlake’s blobs folder. Saltlake reported 1.64TB on 1st of August. The missing ~10TB is exactly what the earnings script reports as “uncollected garbage”.

What is your explanaition for the missing 10TB?

Ruskiem · August 3, 2024, 4:04pm

I expected 6 months minimum from the announcements.
3 passed. And dont expect miracles in the middle of vacation season.
I’m really impressed how fast storj is progressing with the preparations.
Crucial works. When once sorted out, even sky isn’t a limit.

Vadim · August 3, 2024, 4:07pm

check report by sattelites, US and SL not reporting space, that’s all problem.

BrightSilence · August 3, 2024, 4:09pm

No it doesn’t, it uses the last report for each satellite, not the monthly average.

It automatically checks whether the last report deviates a lot from the monthly average and adds a warning below the overview when it does. This should catch low last reports. The disclaimers are there and I have always been very clear that this relies on the best data the node can provide. If the source data is crap, I try to warn people about that. There is not much I can do about the satellite issues or node issues causing source data to be faulty or people ignoring the obvious warnings I’ve added.

It certainly never is and never was presented with certainty.

It does, but unfortunately there are sometimes also incomplete reports from satellites. Not much I can do about that except adding the warning.