It is obsoletely not good that Garbage Collection is only once a week.
it Hit lot of I/O at one time.
data that was deleted by client week ago and not payd any more is whole week on nodes and then another week on Basket and then we see graphs when we have 10tb data and was paid only for 9TB.
It will be much better if GC run every day.
Because it’s initiated by send of the bloom filter from the satellites. I guess It could be an expensive operation for them, because it is a chore across all pieces in all segments.
If the garbage collection ( bloom filter processing) runs only once a week, and the pieces must stay in the trash for 7 days then I don’t see why the process that empty the trash must run this often.
2023-07-21T02:13:17.462Z INFO pieces:trash emptying trash started {"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-07-21T20:43:37.267Z INFO pieces:trash emptying trash started {"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
The majority of runs will not find any files to delete, and only causing unneeded IO, am I rigth?
is there a config option to control this frequency?
who may guarantee, that you did not restart your node, or did not have a downtime, so the start time to account 7 days will vary. The only way is to ask a filesystem when was an update last time.
Yes, that is why it should run at node startup, as it currently does. But after that if you don’t restart the node there is not need to run it so often.
There are several similar but different processes:
filewalker to calculate used and free space in the allocation and to update cache (databases). It needs to prevent an over usage of the allocation and report the actual used and available space to the satellites;
pieces expiration chore. It goes through the trash and removes expired pieces;
garbage collector chore. It applies a bloom filter to all pieces, moves unmatched to the trash.
I’m aware of that. I’m talking about the process in your second bullet point, the process that “goes through the trash and removes expired pieces”
And to illustrate my point, a few day ago there were a massive delete from the europe-north satellite, and some of us ended with TBs of data in the trash. We needed to wait for 7 days to be cleared, but in the mean time this process was run many times processing those files.
we need to have a parameter to set the frequency of this process, and not to be harcoded.
I do not see any better solution so far. You need to check the update timestamp of the piece in the trash and remove it, if it older than 7 days.
Frequency is once a day, this is enough, because your node moves pieces to the trash on deletion requests too, not only on garbage collection.
To illustrate:
Then you need to modify this piece of code and build your own binary to support this feature.
You may also submit a PR, our team would be glad to accept it, the Community contribution is very welcome!
Then I cannot explain, why update stamps of pieces in the trash so different. Maybe only if the Garbage Collector is too lazy now and working days instead of few hours…
Frankly, after the change where the collection pass is separate from actual moving the files, I wouldn’t be surprised. I’ve noticed it in this thread and asked, but got no explicit answer.
Indeed, and because of that, now the config option pieces.delete-to-trash: false doesn’t make any difference. Every unneeded file is handled by the retain process and sent to the trash