Drop Bloom filters, bring back direct deletes

jammerdan · August 14, 2024, 4:07am

I don’t see what exactly causes the high load. Maybe it is a misunderstanding. Let’s try to solve that.
First of all I am under the impression that compared to the times you are mentioning, the satellite infrastructure has improved. So maybe it will handle it better today.
Second it is hard to imagine that the satellites can handle millions of requests for uploads, downloads etc. but cannot send out requests for deletion.
But even if that is the case, what I mean by direct deletes does not necessarily mean real-time. I would see that as optimal solution, but what I want to emphasize by direct deletes is to tell the node exactly which piece to delete and not have it scan all pieces to find the piece that should not be there.
It would be astounding to me if the satellites cannot provide that information.
Thirdly my understanding is that also the creation of the Bloom filter is a heavy process. If I remember correctly in the past we were always told that sending of more frequent Bloom filters is not possible as it is too heavy to run the required processes more often.
I can remember that there was issues with the client when direct deletes were in place, however if you read my suggestion that made clear that I am totally fine with an async operation so that the client receives a delete information as quickly as possible and the process goes on in the background. So there is no change in that.

Yes, you are right. I forgot about that the process on the node can make use of the lazy filewalker if not disabled. I have said that this is an option for async operation that should be available. But that’s not an argument against direct deletes. Direct deletes should still result in less Bloom filter creation, less retaining, less garbage, less trash and faster deletion.

But why? You are basically saying the system can handle uploads and downloads in real-time but not a deletion request. This does not make sense to me because I don’t see a difference to an upload or download request and satellites receive millions of those and can obviously handle them.
I also don’t see why the satellites or the client have to wait for anything, as said, direct delete does not necessarily mean real-time:
Customer sends deletion request to satellite. For the customer side we could stop here. As soon as the request for deletion is received, we can signal the customer that the file is deleted. The process can go on in the background carried out by the satellite.
However an idea to reduce load for the satellite, what the satellite can do, similar to upload or download requests, provide a list of IPs and pieces to the client. With that list, the client can contact the nodes directly with the deletion request. Again, it is my understanding, that this is exactly what we are doing when a client wants to upload or download.
Even here I think we don’t have to wait for anything as long as we can make sure that the process can go on in the background. It would be enough that the client sends out the wish for deletion and the node acknowledges that it has received that request and put it in a queue (expiry database). From there it can fulfill the request independently exactly like it deletes pieces that have their expire stored in the expiry database. For that we also do not need any client connection.
Same goes for deletion requests that come from the satellite. We do not have to wait until the piece is finally deleted we only transmit the request to the node.
When the node finally deletes it it should create an order file and send that to the satellite. This serves as receipt that the file has been deleted. Then the satellite can mark it in the database as gone.

At least this is the way I see it.

That’s directly related to the trash unpaid discussion. Of course one benefit would be to reduce the large amounts of unpaid garbage. Imagine Storj would have to pay $10/TB for that. I am sure you would find a different solution very quickly.

No, I don’t think so. The context was about the question of whether direct deletes cause the same amount of work as the usage of Bloom filters. And obviously, you have criticized in the past that Bloom filters consume a lot more IOPS. So, Bloom filters seem to have that as a general disadvantage over direct deletes. That’s also why you are trying to mitigate that with the new cache. However, this does not change the inherent disadvantage of Bloom filters which was the point I was trying to make.