Why Garbage Collection is only once a week?

Vadim · July 21, 2023, 8:22pm

It is obsoletely not good that Garbage Collection is only once a week.

it Hit lot of I/O at one time.
data that was deleted by client week ago and not payd any more is whole week on nodes and then another week on Basket and then we see graphs when we have 10tb data and was paid only for 9TB.
It will be much better if GC run every day.

nyancodex · July 21, 2023, 8:36pm

sometimes we need to get back the data from trash?

Vadim · July 21, 2023, 8:40pm

Trashbin it is backup function if there is some bugs in the code, to have time till permanent deletions.

Alexey · July 22, 2023, 3:21am

Because it’s initiated by send of the bloom filter from the satellites. I guess It could be an expensive operation for them, because it is a chore across all pieces in all segments.

mgonzalezm · July 22, 2023, 7:06pm

If the garbage collection ( bloom filter processing) runs only once a week, and the pieces must stay in the trash for 7 days then I don’t see why the process that empty the trash must run this often.

2023-07-21T02:13:17.462Z	INFO	pieces:trash	emptying trash started	{"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2023-07-21T20:43:37.267Z	INFO	pieces:trash	emptying trash started	{"process": "storagenode", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}

The majority of runs will not find any files to delete, and only causing unneeded IO, am I rigth?
is there a config option to control this frequency?

Alexey · July 23, 2023, 5:22am

who may guarantee, that you did not restart your node, or did not have a downtime, so the start time to account 7 days will vary. The only way is to ask a filesystem when was an update last time.

no, it’s hardcoded

github.com

storj/storj/blob/b16c8ba2e4d5b6b5aaf32867b952b9b1f5419fe5/storagenode/peer.go#L497


      
          
          		peer.Storage2.PieceDeleter = pieces.NewDeleter(log.Named("piecedeleter"), peer.Storage2.Store, config.Storage2.DeleteWorkers, config.Storage2.DeleteQueueSize)
          		peer.Services.Add(lifecycle.Item{
          			Name:  "PieceDeleter",
          			Run:   peer.Storage2.PieceDeleter.Run,
          			Close: peer.Storage2.PieceDeleter.Close,
          		})
          
          		peer.Storage2.TrashChore = pieces.NewTrashChore(
          			log.Named("pieces:trash"),
          			24*time.Hour,   // choreInterval: how often to run the chore
          			7*24*time.Hour, // trashExpiryInterval: when items in the trash should be deleted
          			peer.Storage2.Trust,
          			peer.Storage2.Store,
          		)
          		peer.Services.Add(lifecycle.Item{
          			Name:  "pieces:trash",
          			Run:   peer.Storage2.TrashChore.Run,
          			Close: peer.Storage2.TrashChore.Close,
          		})

mgonzalezm · July 23, 2023, 3:25pm

Yes, that is why it should run at node startup, as it currently does. But after that if you don’t restart the node there is not need to run it so often.

Vadim · July 23, 2023, 4:16pm

looks like you mismatched garbage collector with filewalker

Alexey · July 24, 2023, 4:07am

There are several similar but different processes:

filewalker to calculate used and free space in the allocation and to update cache (databases). It needs to prevent an over usage of the allocation and report the actual used and available space to the satellites;
pieces expiration chore. It goes through the trash and removes expired pieces;
garbage collector chore. It applies a bloom filter to all pieces, moves unmatched to the trash.

This is not a complete list though.

mgonzalezm · July 24, 2023, 5:17am

I’m aware of that. I’m talking about the process in your second bullet point, the process that “goes through the trash and removes expired pieces”

And to illustrate my point, a few day ago there were a massive delete from the europe-north satellite, and some of us ended with TBs of data in the trash. We needed to wait for 7 days to be cleared, but in the mean time this process was run many times processing those files.

we need to have a parameter to set the frequency of this process, and not to be harcoded.

Alexey · July 24, 2023, 5:33am

I do not see any better solution so far. You need to check the update timestamp of the piece in the trash and remove it, if it older than 7 days.
Frequency is once a day, this is enough, because your node moves pieces to the trash on deletion requests too, not only on garbage collection.
To illustrate:

# 4 days old
Get-ChildItem w:\storagenode5\storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-4)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}
Total 475.45 MB
count 1254

# 3 days old
Get-ChildItem w:\storagenode5\storage\trash\ -File -Recurse | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-3)} | Measure-Object -Property Length -Sum | %{"Total {0:f2} MB`ncount {1}" -f ($_.Sum/1e6), $_.Count}
Total 4239.33 MB
count 56206

so, if I would check only once a week from node’s start, that data will be there longer than a week, depending on when I restart this node.

striker43 · July 24, 2023, 10:59am

That would be fine for me if I could reduce the IO pressure on my drives by running these processes less often.

Alexey · July 25, 2023, 3:34am

Then you need to modify this piece of code and build your own binary to support this feature.
You may also submit a PR, our team would be glad to accept it, the Community contribution is very welcome!

BrightSilence · July 27, 2023, 11:04am

Not anymore. Since a recent change normal deletions are handled by GC as well.

Alexey · July 28, 2023, 4:14am

Then I cannot explain, why update stamps of pieces in the trash so different. Maybe only if the Garbage Collector is too lazy now and working days instead of few hours…

Would check the next week.

Toyoo · July 28, 2023, 8:59pm

Frankly, after the change where the collection pass is separate from actual moving the files, I wouldn’t be surprised. I’ve noticed it in this thread and asked, but got no explicit answer.

mgonzalezm · July 29, 2023, 6:41pm

Indeed, and because of that, now the config option pieces.delete-to-trash: false doesn’t make any difference. Every unneeded file is handled by the retain process and sent to the trash