Drop Bloom filters, bring back direct deletes

jammerdan · August 13, 2024, 3:39am

People are saying Bloom filters cause high IO:

And this is what I am seeing on my nodes.

Even Storj admits this: https://review.dev.storj.io/c/storj/storj/+/13081 :

Single retain operation is already heavy for storage node. At the moment it make no sense to try do this concurrently.

I think they should:
I don’t see extra IO caused by direct deletes it is the other way around: The usage of Bloom filter as primary source for deletion of files causes extra IO and resources required for processing the Bloom filter, retaining and deleting. And at least the current version of the node software does not care if the nodes is currently loaded or not. It processes the Bloom filter regardless of “important traffic” anyway.

But direct deletes have more than one advantage. (By direct delete I mean to tell the node what to delete)

Like with pieces that expiry you would not need a trash. So even if “important traffic” would be a little bit impacted, you would save a lot of unpaid time the garbage remains on the node during Bloom filter creation, transmission, retaining, staying in trash and finally trash cleanup.

The second big plus is, that we had direct deletes before and we did not have any of the frequent issues we are reading almost every day like mismatch of satellite and nodes data, payout questions. Many of the issues we are seeing today are a result of moving from direct deletes to indirect deletion through Bloom filters.

The third big plus is that there is less garbage on the nodes.

I just don’t think throwing more resource at the nodes is the right way. I read it a lot lately that you should add more RAM, add an SSD, cache this, cache that, change FS. That is the Filecoin style but not the Storj idea. But it clearly proves that the system is not working efficient.
And the thing is, you cannot always hardware even if you wish, like with the Odroid HC2.

I am pretty sure the system would handle deletes as they come in randomly and spread-out better than the constant 24/7 processing of Bloom filters.

But this is not how it is done today. The current implementation does not take into account the load. And it is not a case against direct deletes. By direct deletes I mean to tell the node what to delete instead of telling it what to keep. We can talk about how the node can handle that.

While I believe the majority of pieces should be deleted right away, you could also pack the information what to delete into the expiry database or send logs hourly to the nodes and they work through that. I agree that with such an async operation an option to do that in a “lazy” fashion should be offered.