Bloom filter (strange behavior)

jammerdan · May 18, 2024, 4:41am

I don’t understand that. Wasn’t the whole idea of storing the bloomfilter on disk to resume on a node restart? And the feature that introduced it was completed:

github.com/storj/storj

storagenode/{pieces,blobstore}: save-state-resume feature for GC filewalker

committed 09:56PM - 02 Apr 24 UTC

profclems

+276 -77

The bloomfilter is now stored on the disk from this change https://review.dev.st…orj.io/c/storj/storj/+/12651, so we don't lose the bloomfilter on restart. But that means that when the bloomfilter on the disk is loaded and added to the retain queue, it will start scanning the directories all over again. This will continuously be the behavior till the node is able to complete a full scan which might never happen on a large node. In this patch, we make sure the node can resume from where it left off. We read the two-letter directory names (or prefixes) from the satellite blobs folder and sort them alphabetically before scanning. During the scan, the last scanned prefix is stored in the db. Updates https://github.com/storj/storj/issues/6708 Change-Id: Icb32cdc7dd49ef8ce44f6d771e4e33045078ed55

It also clearly states:

The bloomfilter is now stored on the disk from this change
https://review.dev.storj.io/c/storj/storj/+/12651, so we don’t lose
the bloomfilter on restart. But that means that when the bloomfilter
on the disk is loaded and added to the retain queue, it will start
scanning the directories all over again. This will continuously be the
behavior till the node is able to complete a full scan which might never
happen on a large node.

And why the assumption to consider a GC as failed when a node gets restarted? This is not necessarily the case, a node can be restarted for hundreds of reasons. Only when the GC has failed the bloomfilter should be removed if the result instead would be that it continuously keeps scanning and failing.

As you are saying the https://review.dev.storj.io/c/storj/storj/+/12806 would be the resume part this means that it is not just for used-space filewalker like the title says

storagenode/pieces: save-state-resume feature for used space filewalker

but also for the GC-filewalker? So at the moment no filewalker really can resume from a node restart, is that correct?

If this is the case the other posts about errors with gc-filewalkers or trash problems are solved, the feature is simply not finished:

It was my impression that nodes in fact do get restarted regularly because of the updates roughly every 14 days. In general it would be better if you would assume the nodes getting restarted often. There are hundreds of reasons for a node operator to restart a node. From a Windows update restart to simply changing the allocated space. If you assume nodes do not get restarted then this will lead to the same situation that made you not save states before like the bloomfilter or the state of the filewalkers. It would be better you assume nodes get restarted any time and should then resume whatever they were doing before.

I hope you do.

Because now it has turned out that it does not “just” prevent the used-space filewalker from finishing but also the gc-filewalker which makes it worse to get rid of the garbage and have correct numbers on occupied space and trash.

And please consider to have a separate log entry for start and resume of filewalker. This would help to verify from the logs that is is in fact resuming and not starting from the scratch.