Several nodes are showing the same behavior like this sample:
Trash on disk (du -sh): 60G
Trash according to node dashboard: 0.6TB
The discrepancies are significant. One node has 22G in the trash on disk but over 1 TB according to the node dashboard.
I don’t see errors with the filewalker process. It gets reported as finished successfully. And when I look at the history trash graphs it shows, that the reported trash never really diminishes below a certain threshold.
This issue is significant as nodes happen to get reported as full to the satellite while in reality they should have plenty of space available on the disk.
I have the same issue. I found that after restarting the node, the amount of trash is significantly less. One node I had to restart multiple times to correctly (so I presume) show the amount of data.
It depends on filewalker on start. If it did not finish its work, it will not update this info. So, you need to wait until it is finished before restart.
Please note - it updates a stat only after successful traversal, not in the process.
But also regarding filewalker where it would complete the filewalker successfully for one satellite but not for others
lazyfilewalker.used-space-filewalker subprocess finished successfully {"process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
lazyfilewalker.used-space-filewalker failed to start subprocess {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
failed to lazywalk space used by satellite {"process": "storagenode", "error": "lazyfilewalker: context canceled",
And I also see the filewalker getting killed:
lazyfilewalker.used-space-filewalker subprocess exited with status {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
pieces failed to lazywalk space used by satellite {"process": "storagenode", "error": "lazyfilewalker: signal: killed"
And it seems like it keeps trying to redo the process in a loop.
~ slow disk, canceled by timeout. Or if the node is shutting down.
this one is interesting, do you have a shutdown somewhere before that?
Since the context canceled event is happening only for the one satellite, I would suggest to check the filesystem on errors, perhaps pieces of this satellite are affected.
It looks like there are 2 shutdown reasons. One comes from Docker OOM the other after a storagenode update.
Both means the filewalker does not resume where it was, it restarts from beginning, which is terrible design.
Half of my nodes display Trash 0 on dashboard, the others under 200 MB. I thought there is something wrong with the db-es, and I checked the folders. There are only empty folders in the Trash folder, no files, so the dashboard is right. If the Trash is emptied once a week, that means there were no deletes to trash in a week? This never happened, I always had files in trash. Does anybody sees empty Trash folders? Can someone explain it?
My nodes are not full and have between 8 and 13 TB of data.