When will "Uncollected Garbage" be deleted?

littleskunk · August 4, 2024, 11:52pm

TTL data shouldn’t need garbage collection in the first place. You are hunting the wrong witch.

On my nodes the payout was Ok. Not great but good enough. I had some extensive downtime over my vacation that I couldn’t fix remote. Now the nodes are back up and can recover from the downtime. Should get better next month. With the downtime in mind I am happy about my payout. Could have been worse for sure so I take it.

On my nodes the TTL cleanup seems to work. Also no issues with garbage collection. Everything within my expectation. Storagenode dashboard shows inccorect numbers but that are all known issues.

Alexey · August 5, 2024, 1:11am

If the tally is working longer than 24h, it may send a partial report (it will be finished later and will be correct, but likely will not be sent to nodes, because it’s out of the time range). So, this partial report is correct only for this partial time range, but not for the day.

This is a good question. Because they calculated based on different metric - usage from the orders, not based on an average usage. If your local stat on the piechart is correct, then you can estimate based on the local stat. I do not see any other way. Except implement a new average which would skip missing and partial reports (they would be noticeable less than a previous and the next).

It’s used to prepare the average usage by tally, but it cannot be calculated in time on some satellites. We need to implement a new representation of the average used space on the nodes, which would skip missing or incomplete reports. Then it would be more accurate, but still only an estimation.

There is a way, if the local stat is correct, it could be used for estimation. However, if the uncollected garbage is exist or the stat is incorrect, the estimation would be incorrect too.

Yes, it calculates the usage, it doesn’t have a knowledge is it a paid data or not.

brainstorm · August 5, 2024, 9:27am

about that form to report BF issues…

How do I find which satellite is causing the problem ?

littleskunk · August 5, 2024, 9:30am

SLC? This hole thread is about the SLC satellite isn’t it?

donald.m.motsinger · August 5, 2024, 9:56am

Here are some numbers

I have 8.8TB of files older than 30 days from saltlake. This node have had ~2TB in total before the tests started. If all data from the test has a TTL of 30 days or less it’s impossible to have 8.8TB out of 11TB older than 30 days if expired pieces get deleted upon expiration.

donald.m.motsinger · August 5, 2024, 11:35am

Right, “shouldn’t”. What’s your explaination of the figures above?

littleskunk · August 5, 2024, 12:55pm

Still the same. For some unknown reason it didn’t insert everything into the TTL DB.

donald.m.motsinger · August 5, 2024, 1:21pm

We’re coming closer. What could “some unknown reason” be? I have only 2 database locked errors within 1 second in July in my logs, and that’s with the bandwidthdb.

Can you give me a definite answer between what dates 100% of saltlake’s data was uploaded with a TTL of 30 days or less? Then I could “fix” this by myself.

Roxor · August 5, 2024, 1:30pm

If the TTL DB doesn’t have entries to tell the node what pieces to delete… then I guess we just wait for the regular garbage to be collected a couple times a week… and that will slowly chip-away at the forgotten TTL data?

Maybe not ideal… but it sounds like just waiting will fix it. As long as it’s not still happening…

littleskunk · August 5, 2024, 2:08pm

I would bet it is still happening and we still need to find the reason for it.

One easy way to verify it would be to count the number of upload successful log entries and compare it with the latest TTL entries. The numbers should match.

Edit: I was reading somewhere that the upload pattern on SLC has changed. So if you have the logs from the last 30 days that might work better than the current hour.

donald.m.motsinger · August 5, 2024, 2:26pm

So far the theory. Last GC moved <150GB to trash and the GC before that even less.

I have set the loglevel for piecestore to FATAL, so I don’t have logs. I could change that, but this node is full (with garbage) and won’t accept new uploads.

If you could answer

I could make same space.

But you said something changed with the upload pattern on SLC. What else can I do?

Roxor · August 5, 2024, 2:29pm

The same time the lost data was piling up, people were reporting floods of “unable to delete piece” log entries. (and from the stack track in that top example I see “*(pieceExpirationDB).GetExpired:71\n\tstorj.io/storj/storagenode/pieces” which sounds like it could be the TTL cleanup logic.

If anything was wrong with the TTL DB entries (like a path/piece was off by one character or something)… could the cleanup logic just fail to delete the piece… but delete the DB entry anyways… and thus forget about the TTL piece in the blobs folder?

Even if it’s fixed now… if it was b0rked in 105 or 108 or something then a lot of lost TTL data would still be hanging around.

littleskunk · August 5, 2024, 2:30pm

You can debug it yourself or wait for someone else to do it.

donald.m.motsinger · August 5, 2024, 2:32pm

I’m never sure with your comments whether you’re serious or sarcasitic. Which one is it this time?

littleskunk · August 5, 2024, 2:42pm

Possible but unlikely. I don’t think my nodes are affected but I haven’t run the numbers yet. So we need to find a bug that is a bit more complicated to trigger and not just a general TTL doesn’t work on all nodes.

But to your previous question. If we don’t care about previous versions we could just sit back and wait to see if this problem will go away. After all that is what I am doing on my own nodes at the moment. I ignore that the numbers on my dashboard are wrong and continue improving my success rate first. That is my highest priority right now. Second is to make sure the success rate stays high even when the node gets full. So I will have to run more tests with badger cache and inode cache to see which of these solutions works best for me. After that part is done I might have some CPU cycles to spend on running numbers for TTL cleanups or other possible root causes.

littleskunk · August 5, 2024, 2:47pm

I am serious. We need somebody to dig into this. The data your have provided indicates that some entries haven’t been written into the TTL db. We don’t know why that is. I suspect it will happen again. We can talk all day here in the forum but that is clearly not going to fix it.

I am thinking about ways how I can assist with the debugging part. I am happy to help with that. At the moment my nodes are in the middle of a migration so they aren’t even a good reference

donald.m.motsinger · August 5, 2024, 5:35pm

That’s one possibility. Another one is that the deletion of expired pieces doesn’t work reliably.

I find it hard to debug for these reasons

not all data from SLC have a TTL, preferably all 30 days
the satellite_id and piece_id columns in the piece_expiration.db are blobs, which I don’t know how to decipher
…

So I can’t just count the pieces uploaded on 2024-07-07 and count the pieces expiring on 2024-08-06 in the piece_expiration.db. For the same reason I can’t just count the pieces uploaded on 2024-07-05 left on 2024-08-05.

littleskunk · August 5, 2024, 5:46pm

I don’t think so. You already posted the content of your TTL DB. If it would show similar numbers like my TTL DB than sure we could investigate why it doesn’t delete them on your node. Since my TTL DB has way more entries than yours it looks more like a problem with inserting into the TTL DB.

The way this works is probing the most likely theory and than go down the list. Sure there could be a problem with deleting the expired pieces but lets start from the top of the list and go down one by one.

I would bet you can. You are right that it will not be a perfect match for the reason you have pointed out. On some days I have millions of TTL entries in the DB and I would bet 99% of that is SLC so we can as well ignore the other 1%. If there is a problem with inserting into the TTL DB we should see way more upload success messages. Like factor 10 more. At that point we can skip all the points you mentioned and continue debugging where the inserts are getting lost.

littleskunk · August 5, 2024, 5:50pm

One more thing. My nodes are also running with the wrong log level and I don’t want to restart them because I am a bit behind with TTL deletes on one node and don’t want to restart that process. Did you know there is a way to change log level without having to restart the node? Should we try that? I am happy to try it on my node first but since my node doesn’t seem to be affected I do need some volunteers to do this.

edo · August 5, 2024, 6:09pm

I’m also willing to help out debugging…