Trash on disk does not match reported trash

I am having this issue on more than one node for months.
I have been waiting for the version 1.92.1 to exclude this issue: Us1.storj.io disk average 30% lower than actual disk usage - #18 by thepaul

But my issue is still present.

Several nodes are showing the same behavior like this sample:

Trash on disk (du -sh): 60G
Trash according to node dashboard: 0.6TB

The discrepancies are significant. One node has 22G in the trash on disk but over 1 TB according to the node dashboard.
I don’t see errors with the filewalker process. It gets reported as finished successfully. And when I look at the history trash graphs it shows, that the reported trash never really diminishes below a certain threshold.

This issue is significant as nodes happen to get reported as full to the satellite while in reality they should have plenty of space available on the disk.

I think I saw the same behavior on my nodes…

A post was merged into an existing topic: When will average node space be the same as my currently used space?

Do you have errors related to retain?

I have the same issue. I found that after restarting the node, the amount of trash is significantly less. One node I had to restart multiple times to correctly (so I presume) show the amount of data.

It depends on filewalker on start. If it did not finish its work, it will not update this info. So, you need to wait until it is finished before restart.
Please note - it updates a stat only after successful traversal, not in the process.

I am seeing 2 different types of errors now on some nodes:
Regarding retain there are some but not during the last couple of hours:

retain  failed to delete piece  "error": "pieces error: pieceexpirationdb: context canceled",

But also regarding filewalker where it would complete the filewalker successfully for one satellite but not for others

lazyfilewalker.used-space-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
lazyfilewalker.used-space-filewalker    failed to start subprocess      {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
failed to lazywalk space used by satellite      {"process": "storagenode", "error": "lazyfilewalker: context canceled",

And I also see the filewalker getting killed:

lazyfilewalker.used-space-filewalker    subprocess exited with status   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
pieces  failed to lazywalk space used by satellite      {"process": "storagenode", "error": "lazyfilewalker: signal: killed"

And it seems like it keeps trying to redo the process in a loop.

~ slow disk, canceled by timeout. Or if the node is shutting down.

this one is interesting, do you have a shutdown somewhere before that?

Since the context canceled event is happening only for the one satellite, I would suggest to check the filesystem on errors, perhaps pieces of this satellite are affected.

It looks like there are 2 shutdown reasons. One comes from Docker OOM the other after a storagenode update.
Both means the filewalker does not resume where it was, it restarts from beginning, which is terrible design.

2 Likes

Half of my nodes display Trash 0 on dashboard, the others under 200 MB. I thought there is something wrong with the db-es, and I checked the folders. There are only empty folders in the Trash folder, no files, so the dashboard is right. If the Trash is emptied once a week, that means there were no deletes to trash in a week? This never happened, I always had files in trash. Does anybody sees empty Trash folders? Can someone explain it?
My nodes are not full and have between 8 and 13 TB of data.

Ouhh… so the next run will hammer our nodes big time.