Why Garbage Collection is only once a week?

It is obsoletely not good that Garbage Collection is only once a week.

  1. it Hit lot of I/O at one time.
  2. data that was deleted by client week ago and not payd any more is whole week on nodes and then another week on Basket and then we see graphs when we have 10tb data and was paid only for 9TB.
    It will be much better if GC run every day.
1 Like

sometimes we need to get back the data from trash?

Trashbin it is backup function if there is some bugs in the code, to have time till permanent deletions.

Because it’s initiated by send of the bloom filter from the satellites. I guess It could be an expensive operation for them, because it is a chore across all pieces in all segments.

If the garbage collection ( bloom filter processing) runs only once a week, and the pieces must stay in the trash for 7 days then I don’t see why the process that empty the trash must run this often.

2023-07-21T02:13:17.462Z	INFO	pieces:trash	emptying trash started	{"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-07-21T20:43:37.267Z	INFO	pieces:trash	emptying trash started	{"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}

The majority of runs will not find any files to delete, and only causing unneeded IO, am I rigth?
is there a config option to control this frequency?

who may guarantee, that you did not restart your node, or did not have a downtime, so the start time to account 7 days will vary. The only way is to ask a filesystem when was an update last time.

no, it’s hardcoded

Yes, that is why it should run at node startup, as it currently does. But after that if you don’t restart the node there is not need to run it so often.

looks like you mismatched garbage collector with filewalker

1 Like

There are several similar but different processes:

  • filewalker to calculate used and free space in the allocation and to update cache (databases). It needs to prevent an over usage of the allocation and report the actual used and available space to the satellites;
  • pieces expiration chore. It goes through the trash and removes expired pieces;
  • garbage collector chore. It applies a bloom filter to all pieces, moves unmatched to the trash.

This is not a complete list though.

I’m aware of that. I’m talking about the process in your second bullet point, the process that “goes through the trash and removes expired pieces”

And to illustrate my point, a few day ago there were a massive delete from the europe-north satellite, and some of us ended with TBs of data in the trash. We needed to wait for 7 days to be cleared, but in the mean time this process was run many times processing those files.

we need to have a parameter to set the frequency of this process, and not to be harcoded.

1 Like

I do not see any better solution so far. You need to check the update timestamp of the piece in the trash and remove it, if it older than 7 days.
Frequency is once a day, this is enough, because your node moves pieces to the trash on deletion requests too, not only on garbage collection.
To illustrate:

# 4 days old
Get-ChildItem w:\storagenode5\storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-4)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}
Total 475.45 MB
count 1254

# 3 days old
Get-ChildItem w:\storagenode5\storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-3)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}
Total 4239.33 MB
count 56206

so, if I would check only once a week from node’s start, that data will be there longer than a week, depending on when I restart this node.

That would be fine for me if I could reduce the IO pressure on my drives by running these processes less often.

Then you need to modify this piece of code and build your own binary to support this feature.
You may also submit a PR, our team would be glad to accept it, the Community contribution is very welcome!

1 Like

Not anymore. Since a recent change normal deletions are handled by GC as well.

Then I cannot explain, why update stamps of pieces in the trash so different. Maybe only if the Garbage Collector is too lazy now and working days instead of few hours… :thinking:

Would check the next week.

Frankly, after the change where the collection pass is separate from actual moving the files, I wouldn’t be surprised. I’ve noticed it in this thread and asked, but got no explicit answer.

1 Like

Indeed, and because of that, now the config option pieces.delete-to-trash: false doesn’t make any difference. Every unneeded file is handled by the retain process and sent to the trash