Drop Bloom filters, bring back direct deletes

jammerdan · August 15, 2024, 8:56am

My idea is to track that from the order files that the node create and send to the satellite for the deletion. Let’s say customer wants to delete a file with 5000 pieces. Customer uplink sends out 500 deletion requests then quits. Satellite receives 400 deletion confirmation from nodes it sends out 5000 - 400 = 4600 deletion requests for the missing pieces. 3000 deletion confirmation come back to the satellite.
Next, satellite puts the remaining 1600 on logs for the nodes that hold these pieces. 1000 deletion confirmation come back. Remaining 600 for which no confirmation has been received because node is always offline or whatever, pieces get cleared through the next BF.
It’s my idea that through the deletion confirmation the satellite has a clear picture which node has deleted which piece.

I generally don’t see a problem with that. If customers uplink will not be involved anyway, then we could easily wait a little time before satellite sends out deletion requests. It is what @BrightSilence suggested

So customer would not be affected. But also the other suggestion:

or

Maybe could help. I really don’t know how much this feature is used by the customers so it is hard to estimate the impact if those pieces shall be left for GC or if they better be deleted with normal pieces.

Toyoo · August 15, 2024, 2:54pm

This looks like a full table scan of terabytes of rows per query. Unless you add an index, but then you’re increasing database size and increase the time to handle uploads… probably more than twice each due to the fact satellites do not even store piece IDs now.

EasyRhino · August 15, 2024, 4:53pm

Guys, i’m pretty dumb and not following the details but, WHAT IF…

When a file is deleted by a custom, the satellite shoots a delete request to the nodes. totally asynchronously, it doesn’t even need a response, there is no response from the node.

instead of the node actually deleting the file, it just inserts an entry into the piece expiration db. The same one for TTL test data. This way it can still have the 7 day cooling off period after a delete if that’s still desired.

bloom filters would still be unavoidable in this situation to handle offline nodes, etc.

Vadim · August 15, 2024, 6:26pm

I like this idea, but would add TTL only for today, then it will be deleted in 24h
here is no need additional big coding.

BrightSilence · August 15, 2024, 6:43pm

Then a file with multiple copies would be deleted completely if the customer deletes a single copy. Bad news.

Why? It could just move to trash like normal. I’d rather have it delete automatically, but exactly the issue mentioned above is probably an example why these should be in trash anyway.

Ambifacient · August 15, 2024, 6:55pm

So bloom filter generation is expensive, but how expensive is it for the satellite to say, keep track of deletes over a day per node and broadcast the results to nodes daily?

JWvdV · August 15, 2024, 11:28pm

Read up a bit, is the same idea i posted already…

JWvdV · August 15, 2024, 11:39pm

Can tell you that’s what they are made for. Have worked with a database from a hospital, which got from many devices every second a value. So that database measured in terabytes.
But queries on it, didn’t seldom run in seconds. So, no, not really an argument for me.

Toyoo · August 15, 2024, 11:56pm

If you say so, please convince Storjlings. And I’ll grab some popcorn.

jammerdan · August 16, 2024, 5:07am

Yes, yes, yes.

Now I hope they are really going to consider this.

Alexey · August 16, 2024, 5:15am

I do not like to have errors there. Too dangerous.

that’s the point, it’s asynchronous too, so the probability of a race condition would remain. The deletion request checked the metadata and did not find any indication of a server-side copied segment, so it is issued the deletion request to the nodes. Then the server-side copy request finally updated the metadata (with the correct timestamp for the operation per se, but come a little bit later). And boom, the data is disappeared. If SNO is also disabled the trash or it is emptied, the data will gone.

Alexey · August 16, 2024, 5:29am

The satellite do not track single pieces, too expensive and will affect performance, it working with segments. But pieceid can be calculated, however, the calculation itself is too heavy too. This is one of the reasons why we changed the Graceful Exit (earlier the satellite have to calculate a nodes list to where the exiting node should send pieces, it was a relatively heavy resource consuming operation). Also this method would require to make another similar to BF calculation process.
Unfortunately I do not see, how it can improve the process. It sounds like a more frequent BF and we continue to increase size and frequency of their calculations.

However I would notify the team about your proposal.

Alexey · August 16, 2024, 5:42am

To where? To the nodes? No way, too long.

This is still the same problem, which I described several times. It’s resources consuming, the satellite doesn’t work with pieces, it works with segments, but pieceids can be calculated. However doing so for the deletions is too heavy and need to be queued. We did it, it was an option 2.

And as I said multiple times, the direct deletions could have a race condition with the server-side copy. The only feasible way is to queue these deletion requests and execute them only with a huge delay like several hours and process them only for the records before the past date, not immediately.
And we come to the same process, which is now generates BF, which likely more lite than what you have suggested, it doesn’t require a queue, coordination, retries, and deal with whatever communication problems.

As turned out, it’s popular, especially how rclone is deal with moves: it uses a server-side feature of the s3-compatible backend (because S3 doesn’t support server-side move) to copy files to the destination, then issues the deletions for the copied files. This generates a sequence of a server-side copy and deletion of the source, but since it’s asynchronous, the deletion request (with a correct timestamp after the copy timestamp) could come earlier than the server-side copy would update the metadata.
Many people use the rclone mount command and use the bucket almost as a filesystem, and rclone doing a great job to emulate it and it’s pretty fast.
Of course, if they would use rclone with a native Storj backend, then rclone will use the server-side move (because it’s supported by Storj backend), which is free of this issue.

Alexey · August 16, 2024, 5:56am

It’s a little bit better, but still not solves the problem with a race condition, and still requires a separate queue and pieceid calculations and still requires to communicate with nodes on behalf of the customer, which is resource consuming and error prone, also using a bandwidth and increasing a load on the satellite in a runtime.

exactly.

Alexey · August 16, 2024, 5:58am

7 days to emulate a trash

However, I do not like the idea to use a TTL database for that. Because the node will not notify a satellite, that it processed the requested piece from the trash, because the TTL piece is not in the trash.
So it should be similar to the GC process and should move the piece to the trash instead.

jammerdan · August 16, 2024, 6:43am

The number of 500 is an example. It can be 1 it can 1000 it can be n.
It can be whatever number is feasible for the customer. It does not matter. What matters is what my example should have shown that the satellite deals with the rest.

Also it is my suggestion as said multiple times that there is no need to retry if the nodes can download the log from somewhere. And as said multiple times as well, there is a final Bloom filter in place which will catch anything that should not be there.

I have said multiple times that direct deletes don’t have to be real-time. As SNO I would be happy to have deletions after 24 hours for example compared to weeks like it is today.
So I would be absolutely fine with a delay of hours

Alexey · August 16, 2024, 7:09am

The suggestion with logs doesn’t differ from BF in my opinion. It’s the same “log”, but mathematically formed, however, I would agree, that it covers the pieces which should remain, not which should be deleted. However, it requires the same resources to keep it updated (maybe more and likely way more) and an additional overhead of communications with the nodes plus managing an additional queue and a separate chore (re$ource$).

I shared ideas from this thread with the team. If they would make sense, I would expect someone would answer here.

jammerdan · August 16, 2024, 7:32am

I never know if my ideas make sense. I am just trying to be creative.

BrightSilence · August 16, 2024, 7:40am

I wouldn’t get too excited. My posts aren’t magic and the challenges are real. Case in point.

I’d first like to say that I don’t advocate for retrying at all. If the node is offline when it’s sent the first time. Bad luck, GC will take care of it. A delete chore on the satellite with the appropriate delay to ensure it’s aware of server side copies would still have my preference. Especially because of your other comment about the server side copy being finished asynchronously.

It should be 0 though. The problem is that even if it would be only 1 node per segment, that’s a lot of extra communication for large delete operations. It would significantly slow them down. There is also no p2p network available for the nodes to then propagate that info to other nodes. Kademlia was removed ages ago.

It differs quite a lot if you ask me. Delayed explicit deletions would:

be much faster to create and send to nodes, even if you take a slight delay for server side copy into account
not require the node to walk all pieces
not leave 10% behind
not suffer from bloom filters that are too small, leaving even larger percentages behind

Perhaps you are right. If the bloom filter generation could be sped up to be created in like a day and run once every two days with sufficient size to leave no more than 10% behind. Maybe we wouldn’t be having this discussion. But in reality, this has been a problem for many months now. And there is an additional downside of lots of IO for processing bloom filters on the node side. While explicit delete instructions don’t require walking the files. In my opinion, this alone is reason enough to use explicit deletes. Because the you could actually lower the frequency of heavy bloom filter processing.

I understand that it comes with challenges, but I also don’t want the advantages to be downplayed. There is clearly a lot of upside to having some form of direct delete processing.

Vadim · August 16, 2024, 7:49am

It can be done thet this logs are not sended directly to node, but uploaded to storj natwork as file with TTL of 7 days, that only that node can download it. And node will grab it from this network when it want grab it. Why we cant use gratest network made for store data to stor small pice of own needed data. Same way can be handled statistics and all other staff that have to be delivered to node, no need to make p2p all the time.