PSA: something isn't updating trash again

Mitsos · September 12, 2024, 4:50pm

Nodes on v.1.112.2, baremetal linux, completed used-space without any error when updated previously to 1.111.4. Databases on SSD, no errors in logs related to trash, other than the expected:

Sep 11 23:08:20 server1 node6[2607534]: 2024-09-11T23:08:20+03:00        ERROR        blobscache        trashTotal < 0        {"Process": "storagenode", "trashTotal": -44146534884}
Sep 12 17:37:28 server1 node6[2607534]: 2024-09-12T17:37:28+03:00        ERROR        blobscache        trashTotal < 0        {"Process": "storagenode", "trashTotal": -1397551360}

One node just hit 0 bytes for trash (hence the error lines above when it tried to subtract from 0), while there are directories (+files) inside trash. This means that files being moved into trash don’t update the databases correctly, which results in the number not growing but instead only decreasing.

If anyone else notices this, it would be good to know just for sanity check.

Ambifacient · September 12, 2024, 6:19pm

This can happen when the used-space filewalker is computing the size of the trash when there is also a retain process running.

Mitsos · September 12, 2024, 6:31pm

And the solution to that would be what I suggested months ago, to limit the filewalkers to one per time (trash cleanup runs first, when that is completed gc runs next, and used space in the end).

Alexey · September 13, 2024, 6:31am

This is only mean that databases are not updated (yet), could you please check that all used-space-filewalkers are finished without issues (both a filewalker and the database) and no retain is currently running.

Mitsos · September 13, 2024, 2:22pm

Not running, /retain directory empty.

Alexey · September 14, 2024, 4:52am

This is interesting. Perhaps the used trash space has not been updated in a previous version.
So, in a new one it should correct the discrepancy if any.
Does the trash usage matches usage reported by the OS?

Mitsos · September 14, 2024, 6:16am

Node reported 0B trash, files in the trash folders reported by the OS (did not check size, but I doubt all the files are 0B), so no, they don’t match.

Alexey · September 14, 2024, 6:20am

And the used-space-filewalker did finish its scan without a error neither in filewalker nor the databases for the current new version?
If you disabled it, then I would recommend to enable it and restart the node to allow the used-space-filewalker to correct the databases.

Mitsos · September 14, 2024, 6:23am

used-space was run on all nodes when I updated them to 1.111.4, as per OP. I don’t run storagenode-updater, I manually update them so I can control when they update (=no used-space was interrupted at any point).

None of the nodes reported any database error (ie locked), nor any filewalker errors.

Update: 2nd node hit 0B trash.

Alexey · September 14, 2024, 6:25am

I mean after you updated the node, did you allow it to scan?

Mitsos · September 14, 2024, 6:26am

Used space was run (and completed) on every single update for the past 4 months, with the exception to the 1.112.2 update since that version is the one that is supposed to have everything fixed related to space tracking (edit for clarity: all space related fixes were included prior to that version, so the last version that MUST complete used-space is 1.111.4). Can’t be any more clear than this.

Alexey · September 14, 2024, 6:27am

Yes, it supposedly should have all fixes of previous bugs, thus it requires a one scan after the update.
With a badger cache you may do not disable it at all, it’s pretty fast.

Mitsos · September 14, 2024, 6:31am

@Alexey version 1.112.2 did not contain anything related to space tracking, hence why no used-space was run in that version.

Alexey · September 14, 2024, 6:40am

I understand. But you have a discrepancy, so need to run the scan.
I cannot check myself yet, and unlikely would be able, because I have the badger cache enabled and I didn’t disable the scan, so my trash usage is correct on all nodes, even if they are not updated to 1.112.x yet.
But I enabled the badger cache only on a biggest one, two other (1.4TB and 0.8TB) are working fine without any cache in a lazy mode.

Oops, the biggest one just updated, while we chatting…

$ grep "\sused-space" /mnt/x/storagenode2/storagenode.log | grep -E "started|completed"
2024-09-14T05:00:23Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2024-09-14T05:00:27Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Lazy File Walker": false, "Total Pieces Size": 110854656, "Total Pieces Content Size": 110829568}
2024-09-14T05:00:27Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-09-14T05:16:05Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 43545583104, "Total Pieces Content Size": 43518749696}
2024-09-14T05:16:05Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-09-14T05:34:18Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 137194912000, "Total Pieces Content Size": 136975600384}
2024-09-14T05:34:18Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

Mitsos · September 14, 2024, 6:44am

Alternatively the space tracking bug that I have identified in this thread needs to be fixed.

If 1.111.4 is supposed to fix everything related to space tracking, and used space is completed successfully, then there shouldn’t be any discrepancy after that version.

Personally I’m done with the used space filewalkers, will not bother any more with that. There are only two things that matter to me: free space as reported by the OS and that old trash folders are successfully deleted. Will set every node to 1PB and just restart them when they start reporting 5GB free.

Alexey · September 14, 2024, 6:47am

Then it need to be identified yet.
By the way, I saw once, when the trash usage got updated to a correct value after a while. But I suppose your node has had enough time to propagate the trash usage to the databases.

donald.m.motsinger · September 14, 2024, 11:44am

Alexey:

2024-09-14T05:00:23Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2024-09-14T05:00:27Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Lazy File Walker": false, "Total Pieces Size": 110854656, "Total Pieces Content Size": 110829568}

You still have pieces from stefan-benten?

Mitsos · September 14, 2024, 6:41pm

No, all the gone satellites have been cleaned up (only 4 satellite folders left).

donald.m.motsinger · September 14, 2024, 6:43pm

Look again what I quoted. Wasn’t talking about you.

Alexey · September 15, 2024, 2:47am

Yes. I keep it for the purpose to be able to help others which also still have it.