Drop Bloom filters, bring back direct deletes

This is not related to the satellite infrastructure, it’s related to the fact, that all deletion requests will go from the customer directly and requires the sync coordination with all nodes. The satellite here is a coordination point, nothing else.
So, the customer’s uplink will need to submit a synchronous requests to all nodes, which keep their data (pieces of the segments of their data), this adds a significant delay and it’s a bad customers’ experience.
If we offload this as a background job to the satellite, it will increase the load on the satellite. It’s too expensive operation at the moment, and again requires a coordination.
We tried every possible combination. Direct deletions are not the way to go at the moment. And there is no good solution so far.

No, with the uploads or downloads the customers’ expectations are aligned with the fact, that these operations cannot be performed immediately (within a second). But they expects, that the deletion should not take hours (as it with the direct deletions)! Just measure, how long your node deletes pieces from the trash, which it knows in ahead that they should be deleted. Each operation should be confirmed, i.e. synchronous. Now, multiply it to 80 for each segment, because it should contact every node again. And if it’s offline, then there would be a timeout. It’s not acceptable fo the customer.
Running multiple (thousands) such background synchrony jobs is a very high load on the satellite (who can compete with 24k nodes?).

1 Like

Almost correct. The point you are mixing up here is the time. Yes in the past bloom filter caused more iops. That time is over now. Thanks to the badger cache it is now twice as fast on my nodes.

1 Like

The way I saw it, maybe we need both systems.

In case there is a new/unknown bug introduce into the system, use GC.

On daily operation, use direct deletion, since it can be use in batch and async, no way it inferior compared to GC, right?

As I believe never had been put it that way, that to will ever be able to participate with low-end hardware. Although they advice against buying hardware solely for STORJ. They even didn’t tell you it would be a set-and-forget project and you would never need to adapt a thing.

This doesn’t show that it isn’t working inefficient. It does show some hardware can’t keep up the load. Some changes might increase the efficiency and alleviate the load a bit, but since the intent is to let the network grow there will be an end to that option. Again, efficiency might be contribute a bit to the solution. Like badger cache and so on.

And yeah, once it was working another way; but those onces, were other times with a smaller network.

It does, because it’s being run with lower priority than the main storagenode process. We could debate on how low the priority should be (but than the duration might take too long), and how often these boom filters should be issued and what way. But essentially they’re here to stay. Even a deletion log essentially is the most extensive bloom filter you might imagine. And even then you might miss out on a log, and then what?

Actually telling which data is expected to be there, is quite efficient.

To put it mildly, I only use EOL devices. Up till now, even my SMR 2.5" 5400rpm drives with a load of sector errors and zone errors are able to keep up for over a month after switching to ZFS with meta on SSD. Probably ZFS with L2ARC, bcachefs with meta-SSD or any filesystem on LVM + hotspot read cache might work too.

But in the end, you should consider to make some changes. Not disregarding opportunities to increase the efficiency. But going back to delete notifications, really is inefficient in many ways.

Server-side copies. For each delete you need to check whether the customer have another reference to the removed segment somewhere, and maybe not send the delete request to the node after all.

This was a conscious decision by Storj some time ago to make deletions faster.

Seems so. Having a separate index just to track whether a given piece is a part of multiple files might be effectively doubling the database in size. That would be a lot of resources.

Contacting one entity, then some tens of entities, is slower than contacting one entity due to plain old network latency. Besides, you need to contact all nodes, not just the fast ones like with uploads and downloads.

Consider that for uploads and downloads data transfer has to happen. This makes the design of uploads and downloads a trade-off between latency and bandwidth. Latency for contacting the nodes is higher. But total bandwidth is also higher. You don’t care about bandwidth for deletions. You do for data transfers.

1 Like

I mean, this info don’t have to be store in database, it could be store directly on memory (/dev/shm) and flush to disk whenever it feel comfortable. No need distributed system for these, data on /dev/shm or disk could be lost and everything still be fine (we still have GC).

This could also be optimize, we could keep connection alive so no init connection needed (right now we have 25k node - which can be done on a single machine), even if it scale to 1M machine, we could still optimize it – I’m looking at a hash function to determine which machine should connect to which storj server, any servers could join and off at will (called consistent hashing by elasticsearch).

Right now that would mean ~14 TB of RAM for the biggest satellite. And not updating it quickly enough in permanent storage after a server-side copy is made, which is a database update, would lead to data loss.

Why would a customer keep 25k connections open all the time?

1 Like

I misunderstood, if delete come from customer directly to SNOs, then it change thing (really need to understand more how it work).

This is not my suggestion. My suggestion is that customers uplink submits a deletion request to at least one node receives an acknowledgement and deletion goes on in background. No need for keeping the connection to the customer open.

Same goes for offloading this to the satellite. No need to keep connection open. The request is to tell the node to delete a specific file. Node acknowledges that it has received the request, close connection. Done.

Again, this is not my suggestion. My idea is after n acknowledgements from a node (could be a single node) customer can close connection as process goes on in the background. This way customer experience is that deletion is immediately. It is my idea that the node only has to acknowledge that it has received the request for deletion and will perform it. Customer does not have to wait until it is finished.

And if it is too many requests for the satellite to handle write them to a log let nodes download them.

This is the relevant quote for me. Use of Bloom filters are worse and require additional caches and code improvements to make them faster.
Maybe with additional effort direct deletion would be even faster again. Who knows.
But as also said, direct deletions have more advantages for the nodes than just that.

This is what I have said when I described my vision: First to delete it directly, then put it in a log for those that were missed and finally clear all remaining pieces with a Bloom filter. Because also there could be a bug again that put files on the node that should not be there.

The claim is use what you have and there are some minimum recommendation/requirements. But there is no requirement for specific file systems or even SSD. But we hear these suggestions very frequently now. And it is not necessarily about low-end hardware as the datacenter-grade hardware obviously is suffering as well:

How does it not prove that the implementation is not efficient?

And they want to go Exabyte scale which we aren’t even close yet and we are seeing problems left and right. That’s why we need efficient implementations and not implementations where 10TB nodes already start to suffer.

You did not completely read my suggestion. One suggestion was to make the logs available to the node for downloading instead of sending them. This way a node cannot miss it. But even then in my suggestion there is a final Bloom filter to clear out what should not be there as there also can be bugs that put files on the node that should not be there.

2 Likes

We have had it. But disabled direct deletions because of already explained reasons, it will affect the customer in both previous options: direct deletions from client’s uplink affects the customers directly, offloaded background jobs are affecting satellites making them slower for the customers, which affects customers indirectly.
So, we unlikely would return them back, we need to speed up operations for the customers, not slow down.
There is a work in progress to optimize and speedup the garbage collector and Bloom Filters instead. Because this process will remain anyway. With direct deletions it was implemented to clean pieces on nodes, which were offline when the direct request has been sent.

1 Like

And how is it supposedly should be deleted on others? You need to delete 80 pieces of each segment of each file. This is mean you need to contact 80 nodes, multiplied on the number of segments and on the number of parallel deletions. And could get a timeout from some.

And here is a timeout again, the node could be hang, offline, overloaded or restarted. Which adds a latency to the entire job, and this usually increases the resources usage.

It still adds a latency, when the node didn’t ack the request or ack it too late (the node is overloaded for example).

We cannot avoid BFs, because even in your case the node could be offline and miss the deletion request.

Also direct deletions may cause a data loss due to a server-copy feature in combination of the deletion request due to a race condition, because they could be requested in a parallel and even a small latency in one of the processes could cause the situation, where deletion request may reach the nodes before the copy request is saved to the database.

For example, rclone doing it exactly this way when it’s configured with Storj S3 backend and you requested to move some objects from the one bucket/prefix to another. Since S3 specification doesn’t have a server-side move, but rclone doing its best, it issued a server-side copy, then the deletion from the source. Since it can do multiple threads in parallel, those requests may end to be hit the satellite API in a little bit different time (but with correct timestamps), so there is a not zero probability that the deletion request would arrive a little bit earlier, than the server-copied object converted to a normal copy.

This could happen because the server-side copy is not copying data, it’s creating a new metadata record with the pointer to the source data, otherwise it will not be so fast. When you request a deletion request, it should be ignored and instead should remove a previous metadata record and do not send the deletion request to the nodes. But how it would know about this nuance, if the information that this segment also has more than a one metadata record is not yet written to the database, and the deletion request just received an information that there only one metadata record?

Of course we did our best to avoid a race condition, but with delayed deletions it’s pretty safe for the customers data with using also a trash folder.

I read that, but you can have a node of for over 30 days without being disqualified. So then there must be another note keeping on when which nodes downloaded this log?

Although, I must say, this probably is a more viable option in comparison to delete notifications for every piece. Not in the least because the data density might be considerably higher than a notification for every delete. In exchange the bloom filters could be created less frequent, let’s say twice a month. Perhaps even downloadable, so a node can’t miss out on it due to update, maintenance or whatever other reason for being temporarily offline.

I’m not sure I fully understand that, so I share my idea in more detail. I’m not certain if it’s feasible, but here’s how I envision it:

Each node should have a bucket on Storj DCS that is created when the node is set up. The bucket name could be the NodeID, and the satellite would create and manage this bucket.

The node would have permission to download and delete files in its bucket. When the node comes online, it checks its bucket for files like log1, log2, and log3. It downloads these files and processes them meaning to delete the respective pieces.

For every file it deletes, the node creates an order file and sends it to the satellite. This way, the satellite can keep track of which pieces have been deleted from each node and use that for the less frequent Bloom filter creation (my idea was every 31 days) or for whatever purpose.

Oh so you are dropping the important information on purpose? Yea I think that conversation isn’t going anywhere if you just ignore the recent code changes and keep talking about very old code that non of us is running in production anymore.

Yes, because it was a general statement. I cannot help you if you cannot understand that.

1 Like

That means the buckets are being stored on STORJ network. Since every deleted piece subsequently creates 51-80 files, being dispersed in the network and deleted afterwards… You’ve created a huge amplification problem now.

If you would consider something like this, it should look something like:

  1. The node or the satellite is keeping track on the timestamp of last deletion log.
  2. Or the deletion log is being created beforehand on the satellite (taking space and sometimes unnecessary computation time, if the log isn’t downloaded; but can be run as a background task) or is being created on request.
  3. The log is just a list of piece-ids and creation timestamp, that can be used for next creation (see point 1).

In exchange for the process above, the bloom filter should be created less frequent to alleviate satellites and nodes.

1 Like

Yes, that is exactly the question, of how to keep going the deletion even in the background.
I have made suggestions for that:

  1. Customers uplink sends out as many deletion requests as it is fine with. Could be 0, 1 … n
  2. Customers uplink passes the request to at least 1 node and node passes it on p2p: 1, 2, 4, , 16, … This is where we would need definitely an acknowlede by the node so the request does not get lost.
  3. The satellite sends out the deletion requests on behalf of customers uplink
  4. The satellite puts the information which pieces to delete into a log and nodes download this log
  5. Always: Final clear-out with Bloom filter

Yes, but we may not even need the acknowledgement from the node. I think we should have it definitely if it is p2p. But maybe not even then as long as the satellite knows about the file deletion request. The nodes that have successfully received the request will delete the piece and send orders to confirm deletion so the satellite knows what has been deleted and what is still there. Those that have not received the deletion request from uplink will receive deletion request or deletion log from the satellite and if they miss even that final delete with a Bloom filter.

Most important ack is from satellite. Then subsequently it may be enough that customers uplink spits out as many requests to nodes as it is fine with without waiting for an acknowledgment.
But even if we wait for 1 (the fastest) acknowledgement, it would be as fast as if downloading a file today. I don’t know what the average time to first byte is on downloads today. But an acknowledgement from a single node should not take longer.

Yes I see that.
Let me first quote this:

I don’t know how quickly the satellite can determine which pieces to delete or when it would be safe to do so. Ideally it would pass the correct list to the customers uplink so that it would be safe to delete anything that uplink has received. If that is not possible and we rely on the satellite to coordinate the deletion anyway, then I would say the satellite should know which pieces can be deleted and which pieces need to stay and either send out requests or prepare the logs accordingly.

This is most of the complicated suggestion. How the satellite should keep track that? Or do you suggest that the client is marked the segment as deleted on the satellite AND sends also 80*100500 requests to the nodes? They likely end in the situation, where these notifications for nodes may finish only after several minutes/hours… Not a good experience I would say.

The typical TrueNAS user has at least 20,000 objects (up to hundreds millions). Because their Cloud Sync is uploading every single file to the bucket. It’s not a backup replacement actually, it’s a sync.
So, they would need to wait at incredible amount of time to just notify every 80 nodes per segment (I’m even not talking about to get a feedback from each node)…

We would be forced to add some wait time, to be sure that there is no struggling server-copy requests, followed by the deletion requests. Which would either affect a customer directly with the option 1, or indirectly with the option 2.
Also, just estimate the waiting time for the customer with TrueNAS.

I feel like I should respond to this since my posts have been quoted several times in this topic. And I think the snippets may lack some of the nuance of where I stand.

First off, yes, I would still really like for there to be a better way to deal with deletes directly on the nodes, instead of just relying on the slow and clunky GC for everything. But I also understand the complications.

Performance for the customer is paramount. This ultimately means that the only thing that should happen between uplink and satellite is the uplink saying: “I want to remove these files” and the satellite responding: “Consider them gone”. Anything else by definition has to be done asynchronously from the communication with uplink. This is inherently different from uploads and downloads, because by definition there is communication between uplink and nodes there. There shouldn’t be for deletes.

So that leaves it up to the satellite to tell the nodes about those deletes after it has concluded the exchange with the uplink. The satellite doesn’t normally communicate things like that to the nodes, so that’s why this would be a bigger task for the satellite than dealing with uploads and downloads.

Add to that the complication of server side copy causing multiple metadata records to reference the same pieces. This means that if the satellite gets a delete, it may not actually want to delete the underlying data. And a database lookup to see if there are other records pointing to the same segment is expensive to do.

There may be ways around this by having a small deletion cache on the satellite and running a separate chore to go through that cache, check for copies and propagate the information to the nodes. But there is clearly some dev effort required and it costs system resources.
Alternatively, whenever a server side copy is made, the metadata of records pointing to that segment could be updated to indicate that and I would be fine if only for those records, direct deletes are ignored and GC takes care of them. This would trade off a small additional update on creation of a server side copy, but then saves the satellites from having to do such lookups for every delete.

I think something like that should be built to prevent wasting space waiting for the very slow and often insufficient garbage collection process, which should really only be a fallback in case the node was offline or for some other reason has pieces it shouldn’t have. But I’m not going to pretend this is an easy thing to fix or suggest complicated workarounds. I’m pretty sure Storj Labs would have much better insight in how to solve this in the most efficient way than anything I can come up with.

4 Likes

@Alexey:
Can you reply to my proposal, since I think the computation time of deletion logs might be considerable less than bloom filters. So it could be an exchange.

It’s still not a direct delete; which from many perspectives isn’t desirable in my opinion (uplink or satellite must arrange too many requests, too many requests for the node, …).

But a query like "SELECT GROUP_CONCAT(piece_id ORDER BY piece_id ASC CHAR(10 using utf8)) FROM satellitepiecestables WHERE (trashed_timestamp BETWEEN '$sLastTimeStamp' AND '$sCurrentTimeStamp') AND (node_id='$sNodeID')" shouldn’t be to problematic I think > output it to the storagenode or a file, with on the first line $sCurrentTimeStamp.
The storagenode could ask for it every 24-48h? Or it is being pushed to it every 24-48h, with registry keeping what time node has received it’s last deletion log?

If in exchange the bloom filters have to be created only once or two times a month, this could be a win for the satellites as well I assume. Even the gc-filewalker might be run with less priority, which is in favour of slow progressing nodes.

1 Like