Nodes on v.1.112.2, baremetal linux, completed used-space without any error when updated previously to 1.111.4. Databases on SSD, no errors in logs related to trash, other than the expected:
One node just hit 0 bytes for trash (hence the error lines above when it tried to subtract from 0), while there are directories (+files) inside trash. This means that files being moved into trash don’t update the databases correctly, which results in the number not growing but instead only decreasing.
If anyone else notices this, it would be good to know just for sanity check.
And the solution to that would be what I suggested months ago, to limit the filewalkers to one per time (trash cleanup runs first, when that is completed gc runs next, and used space in the end).
This is only mean that databases are not updated (yet), could you please check that all used-space-filewalkers are finished without issues (both a filewalker and the database) and no retain is currently running.
This is interesting. Perhaps the used trash space has not been updated in a previous version.
So, in a new one it should correct the discrepancy if any.
Does the trash usage matches usage reported by the OS?
Node reported 0B trash, files in the trash folders reported by the OS (did not check size, but I doubt all the files are 0B), so no, they don’t match.
And the used-space-filewalker did finish its scan without a error neither in filewalker nor the databases for the current new version?
If you disabled it, then I would recommend to enable it and restart the node to allow the used-space-filewalker to correct the databases.
used-space was run on all nodes when I updated them to 1.111.4, as per OP. I don’t run storagenode-updater, I manually update them so I can control when they update (=no used-space was interrupted at any point).
None of the nodes reported any database error (ie locked), nor any filewalker errors.
Used space was run (and completed) on every single update for the past 4 months, with the exception to the 1.112.2 update since that version is the one that is supposed to have everything fixed related to space tracking (edit for clarity: all space related fixes were included prior to that version, so the last version that MUST complete used-space is 1.111.4). Can’t be any more clear than this.
Yes, it supposedly should have all fixes of previous bugs, thus it requires a one scan after the update.
With a badger cache you may do not disable it at all, it’s pretty fast.
I understand. But you have a discrepancy, so need to run the scan.
I cannot check myself yet, and unlikely would be able, because I have the badger cache enabled and I didn’t disable the scan, so my trash usage is correct on all nodes, even if they are not updated to 1.112.x yet.
But I enabled the badger cache only on a biggest one, two other (1.4TB and 0.8TB) are working fine without any cache in a lazy mode.
Oops, the biggest one just updated, while we chatting…
Alternatively the space tracking bug that I have identified in this thread needs to be fixed.
If 1.111.4 is supposed to fix everything related to space tracking, and used space is completed successfully, then there shouldn’t be any discrepancy after that version.
Personally I’m done with the used space filewalkers, will not bother any more with that. There are only two things that matter to me: free space as reported by the OS and that old trash folders are successfully deleted. Will set every node to 1PB and just restart them when they start reporting 5GB free.
Then it need to be identified yet.
By the way, I saw once, when the trash usage got updated to a correct value after a while. But I suppose your node has had enough time to propagate the trash usage to the databases.