Multiple Bloom filters per satellite

jammerdan · July 31, 2024, 9:59am

I am seeing more Bloom filters now.
On a slower nodes this can lead to the situation that I see multiple Bloom filters for the same satellite:

ls /config/retain
ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1722016799998993000.pb
ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1721584799999920000

Receiving of them are only days apart. I’m struggling to understand how the node is handling them. The main question is, are two (or more) Bloom filters per satellite really necessary? From my understanding of Bloom filters, the most recent one should be sufficient, containing all the files the node needs to retain. So, why process the older BFs first, followed by the newer ones? Doesn’t this approach lead to unnecessary and avoidable I/O operations?

Toyoo · July 31, 2024, 10:21am

Each of them will have different false positives, so it’s not like one contains the other. But yeah, from my point of view this is not worth the additional I/O effort. At the cost of increased code complexity, both could be traversed together though, and that would be nifty!

striker43 · July 31, 2024, 12:00pm

Yes it would be nice if the node would just store the bloom filters locally and then process all of them together at a specified time. For example if I could configure the node that it always starts working on bloom filters every Saturday at 12.

Vadim · July 31, 2024, 12:16pm

this mean if you get BF on Monday, trash will wait till Saturday and then more 7 days in Trash bin.

Ambifacient · July 31, 2024, 1:26pm

The time delta for US1 filters the last wave seems to be 2 days apart, which may be significant. For me one of my nodes deleted 2.2M pieces on 07-26, 1.4M pieces on 07-27, and 0.3M pieces on 07-29, just on US1.

The current architecture is push from satellite, so if a node is offline there is another opportunity for these nodes to get the bloom filter.

I do wonder from a bloom generation perspective if it is cheaper to generate one huge bloom filter every week than it is for multiple smaller filters in each week.

With attempts like the recent badger cache the IO impact shouldn’t be too large, with move operations taking up the bulk of the time.

pangolin · July 31, 2024, 1:38pm

The size of the filter is given by the aimed false positive rate of 10%. Much smaller filters won’t work.

Ambifacient · July 31, 2024, 1:41pm

The current bloom filters are capped to 17MB. Maybe it comes from RAM limitation from bloom generation node.

pangolin · July 31, 2024, 1:44pm

The limit seems to be 25 MB atm. However, still to low for the largest available HDDs.

striker43 · July 31, 2024, 1:53pm

That would be fine for me. Keeping some trash a few days longer doesn’t matter a lot for me. But with this approach my HDD’s would not be at 100% usage all over the week and could be more responsive to requests which would lead to less energy consumption and better success rates.

pangolin · July 31, 2024, 2:12pm

I am happy with bloom filters every 2 days. Please keep that pace.

ifraixedes · July 31, 2024, 4:35pm

why process the older BFs first, followed by the newer ones?

@toyoo responded to this and @Ambifacient showed how each filter deleted pieces.

Doesn’t this approach lead to unnecessary and avoidable I/O operations

We recently changed to processing 1 bloom filter at a time, from processing 3 concurrently to reduce IO performance impact.
This is available in the latest published version.

The limit seems to be 25 MB atm. However, still to low for the largest available HDDs.

To compute larger bloom filters we need more infrastructure resources and may cause struggles for some storage nodes.

We are trying to find the sweet spot between deleting as much garbage as possible and avoiding issues with their computation and processing.

We will constantly increase the size to find the best value.

Toyoo · July 31, 2024, 9:13pm

Not what I meant. Process any bloom filter immediately, but if another comes, it should first join the efforts of previous one and operate on the two-letter directories that were not scanned yet by the first one, and only then do its job on the two-letter directories that were already finished by the first one. Same with more bloom filters.

For example, in pseudo-Python code:

to_scan = collections.defaultdict(list)

# each time a bloom filter comes:
bloom_filter = BloomFilterObject()
for dir in '22', …, 'zz':
    to_scan[dir].append(bloom_filter)

# a separare goroutine running the actual bloom filter scanning:
while to_scan:
    dir, bloom_filters = to_scan.popitem()
    for file in dir:
        for filter in bloom_filters:
            if not filter.contains(file) and file older than filter:
                os.delete(file)
                break

This seems like an interesting problem with which the community might be able to help. Last time this topic was discussed, it was mentioned by Storjlings that it’s a matter of memory usage. I’ve proposed an approach to solve this before, maybe it could be of help to you?

You probably haven’t seen this post, but this statement has been challenged before.

jammerdan · August 1, 2024, 4:20am

Maybe I have a misconception how Bloom filters are processed or you are misunderstanding me because this impacts what I am talking about.

My basic understanding is, if my node misses a Bloom filter it does not really matter, because the next time it receives a Bloom filter it processes basically the same pieces. Maybe a different order or something but at the end the result is the same, pieces that shall no longer be on the node will get deleted by one Bloom filter or the other. If that is not correct, then of course my idea cannot work. But it was my impression, that the node can miss Bloom filters.

So when I have two or more different Bloom filters for one satellite it processes all of them one after another, scanning all files for each Bloom filter. And for each Bloom filter of the same satellite, it goes through all files again.

Instead my idea is to discard the earlier Bloom filter and to traverse the files only once using only the latest Bloom filter.

Edit: AFAIK we store the Bloom filters on node disk for 2 reason: To be able to resume after a node restart and to better accommodate slower nodes. Maybe also need to keep in mind that Bloom filter sizes are constantly increasing.
And now you have the same slower nodes, larger Bloom filters and those nodes have to run multiple times through the same pieces. This sounds like a lot of IO that could be avoided.

Alexey · August 1, 2024, 4:23am

They could work, if the total number of pieces reduced due to a previous filter. But if the previous filter didn’t processed yet, the next one would either the same size or a little bit vary (because the TTL collector may reduce the total amount of pieces and new uploads may increase the total number of pieces).

BrightSilence · August 1, 2024, 5:52am

They all have a false positive rate though, which should be around 10%. But in practice is now often much higher die to reaching the limit in BF size. They are generated with a different seed so they leave different pieces of garbage behind. Processing both of them would clean up more as a result of that. Much more if that false positive rate is high.

jammerdan · August 11, 2024, 2:19pm

It seems that something like that was discussed:

https://review.dev.storj.io/c/storj/storj/+/13081?tab=comments

let’s say we’re processing retain request 325, and while we’re processing it, 326, 327, and 328 come in. is there any value to keeping 326 and 327? i think we can just throw them away and use 328

I don’t think we keep older ones, unless we were already processing the older request. New request replace older ones in the queue.

So in this case, we will continue processing 325, and the newer request 328 will be in the queue; i.e. 327 will replace 326, and 328 will replace 327.

If the comments are about that, then I am wondering why I am seeing more than one Bloom filter for a satellite on some of my nodes.

Alexey · August 12, 2024, 4:26am

Because of:

If a node is processing a BF, it will continue processing it as it has already started, but will then replace the other oldest BFs in the queue with newer BFs if those older ones are not being processed.

jammerdan · August 12, 2024, 4:40am

So you are saying when I see 2 Bloom filters of a single satellite one of it is the one that is currently being processed?
I’ll have to check it I can see that on my nodes.

Alexey · August 12, 2024, 4:50am

Yes, exactly. If the BF is already processing, it will not be replaced in the queue with a newest BF. Only the not processed old ones would be replaced by a newest.

jammerdan · August 12, 2024, 5:10am

This would solve some of the concerns from my original post, which is great.
At least for the BFs that are currently not processing there would be no piling up as only the last Bloom filter would be saved as suggested. That’s really good.

So the only questions left would be:

Could the newest BF somehow be merged with the one that is currently processing?
Shall the current process be exited, the current Bloom filter deleted and restarted only with the new one?
Should at leas when the long running retaining gets interrupted and restarted (maybe after update or so) the process continue only with the newest Bloom filter?