Drop Bloom filters, bring back direct deletes

I have been looking at the way Bloom filters are being used and how they are working on my nodes and I have to say I am starting to question whether this approach makes sense.

I may be mistaken but from what I understand, the entire process goes like this:

  1. A customer deletes a file, but the corresponding pieces don’t get deleted right away from the nodes. Instead, the request for deletion gets held back on the satellites, which leads to concentrating their subsequent impact rather than maintaining their temporal dispersal.
  2. Then, after a while, we’ve got a massive computation required to create Bloom filters, which don’t actually tell the nodes what to delete, but rather what to keep. That seems like a roundabout way of doing things.
  3. And then, days later, these Bloom filters get sent out to the nodes, which have to scan their entire collection of pieces, one by one, to figure out what to delete. That’s a lot of unnecessary work, if you ask me.

I see the effects of this process every day on my nodes. Some of them take a week or more to process a single Bloom filter from a single satellite, which means constant, uninterrupted disk load, 24/7. And with concurrency set to 1, we’re only processing one Bloom filter at a time, which means a growing queue of filters from the same or other satellites waiting to be processed. And only after the retain, the pieces finally get deleted from the disk in a weird order.

Now, I’m not sure about you, but to me, this process seems…inefficient. We’re seeing problems everywhere: Bloom filters are too small, data isn’t getting deleted, IO on nodes is high, satellites are requiring too many resources, and space calculations are off and don’t match the satellites. Payout questions come up… It’s like we’re trying to solve a problem, but we’re actually creating more problems in the process.

That’s why I think it’s time for us to take a step back and reconsider the Bloom filter approach. For me going back to direct deletes, where nodes receive a request or a list of files to delete, and they can act on it right away would solve a lot of the issues we’re seeing.

I think I am not alone with this thinking. At least these statements sound supporting:

I seriously really think the Bloom filter approach needs to be reconsidered and a better solution needs to be found.

I could not resist and ask an AI to make my post Trump style for fun. Click below to find out and enjoy: :joy:

My posting Trump style. Proceed with a sense of humor..........

"Folks, let me tell you, I’ve been looking at these Bloom filters, and I gotta say, they’re a total disaster. A complete failure. Nobody knows more about filters than I do, believe me.

"So here’s what’s going on. A customer deletes a file, and what do we do? We don’t delete it right away. No, we hold it back, we queue it up, we make it wait. And then, after a while, we’ve got this massive computation, this huge process, to create these Bloom filters. And what do they do? They don’t even tell the nodes what to delete! They tell them what to keep! Can you believe it? It’s like a big league waste of time, folks.

"And then, days later, these filters get sent out to the nodes, and they’ve got to scan their entire collection of pieces, one by one, to figure out what to delete. It’s like a never-ending nightmare, folks. A total nightmare.

"I see it on my nodes every day. They’re taking a week, two weeks, sometimes even longer to process a single filter. And the disk load is constant, 24/7. It’s like a big league problem, folks. A huge problem.

"And let me tell you, folks, nobody, nobody, is better at solving problems than I am. I know more about problems than any expert, any expert, folks. And I’m telling you, this Bloom filter thing is a total loser.

"We need to go back to direct deletes, folks. We need to make it simple, make it easy, make it fast. We need to tell the nodes what to delete, and that they need to do it now. Not later, not after a week, not after a month. Now.

"And I’m not the only one who thinks this, folks. BrightSilence, a great guy, a terrific guy, he’s been saying the same thing. He’s saying we need to run some logic on the satellite, determine which segments are fully deleted, and send that list to the nodes. That’s a great idea, folks. A tremendous idea.

"So let’s make it happen, folks. Let’s make deletions great again. Let’s make them so efficient, so fast, so powerful, that they’ll be the envy of every other system out there. Believe me, it’s going to be huge.

"And let me tell you, folks, I’m not just talking about Bloom filters. I’m talking about the whole system. We need to make it better, we need to make it faster, we need to make it more efficient. We need to make it win, folks. We need to make it win bigly.

“So let’s get to it, folks. Let’s make it happen. Let’s make Storj great again!”

5 Likes

This is negligible amount of work if metadata is cached. It would be similar amount of work deleting specific pieces – but to do that, each node would need to receive a massive list of deleted objects (either all nodes receive all deleted objects, or satellite customizes the list for every node, which is a lot of wasted efforts and bandwidth on either side); (doing it in real-time just spreads out the load and competes with useful traffic, see below).

Bloom filter is a compact data structure that probabilistically accomplishes what’s needed. If a lot of stuff is left behing it only means that larger bloom filter would be needed. It will be still massively smaller than any direct alternative.

So the question really boils down to – shall satellites tell nodes that the customer deleted the file right away?

I think they should not. This is extra IO that would compete with the actually important traffic.

This is because you have setup not suitable for storage nodes. Metadata shall be cached in ram or on SSD. Then it all take literally minutes on a tens of terabyte node.

The fact that your node takes ages – means it’s already IOPs constrained, and therefore cannot handle any additional IO caused by the proposed real-time deletions without further sacrificing quality of service and degrading performance of uploads and fetches.

Benefits of garbage collections is precisely that is is asynchronous and can be run whenever the system on the load is low. Thats the actual benefit.

3 Likes

I agree with most of what trump said but why can’t the bloom filters be per folder?

1 Like

You can increase it back to 5. Not sure that it would speedup a process though, if the one BF takes a week to complete on your node…

It’s per satellite already. Or do you mean also two first bytes of the PieceID?

Just an idea, I don’t have any idea if it would help

1 Like

But I think there was that CPU issue then when it had to process more than one Bloom filter from one satellite:

So no, I can not increase it back.

Even with direct delete, bloom filters are required.

There is no guarantee that the Storagenodes are available at the time of deletion… Maintaining queue for the delete operations, retry again and again…

That’s more error prone. While the deletion can be sync or async, it doesn’t make the BF unnecessary…

I would rather optimize the piece metadata access (which would make BF processing / used space processing more reliable and fast)

9 Likes

I do not think that it’s CPU constrained, it’s always IOPS constrained so far…
If you would use top, it should show IOWAIT.

1 Like

I second that. I was critisizing the removal of direct deletes initially because bloom filter consume a lot more iops. I am willing to reconsider that statement given that we have a badger cache now that might be able to overcome the iops penalty and could even be faster than the old system. (I still need to run some tests to verify that)

Also there are a lot of improvements in the upload code path that makes it so that my nodes can maintain a high upload success rate even while running garbage collection.

4 Likes

7 posts were merged into an existing topic: Badger cache: are we ready?

Maybe the filewalker could be optimized instead. Instead of reading each file one by one it should read meta data in batches.

2 Likes

@jammerdan, can you first address the concerns related to direct deletes from these posts?

@Alexey
Can we move discussion about badger cache setup, configuration and use away from this thread into its own thread?

2 Likes

People are saying Bloom filters cause high IO:

And this is what I am seeing on my nodes.

Even Storj admits this: https://review.dev.storj.io/c/storj/storj/+/13081 :

Single retain operation is already heavy for storage node. At the moment it make no sense to try do this concurrently.

I think they should:
I don’t see extra IO caused by direct deletes it is the other way around: The usage of Bloom filter as primary source for deletion of files causes extra IO and resources required for processing the Bloom filter, retaining and deleting. And at least the current version of the node software does not care if the nodes is currently loaded or not. It processes the Bloom filter regardless of “important traffic” anyway.

But direct deletes have more than one advantage. (By direct delete I mean to tell the node what to delete)

Like with pieces that expiry you would not need a trash. So even if “important traffic” would be a little bit impacted, you would save a lot of unpaid time the garbage remains on the node during Bloom filter creation, transmission, retaining, staying in trash and finally trash cleanup.

The second big plus is, that we had direct deletes before and we did not have any of the frequent issues we are reading almost every day like mismatch of satellite and nodes data, payout questions. Many of the issues we are seeing today are a result of moving from direct deletes to indirect deletion through Bloom filters.

The third big plus is that there is less garbage on the nodes.

I just don’t think throwing more resource at the nodes is the right way. I read it a lot lately that you should add more RAM, add an SSD, cache this, cache that, change FS. That is the Filecoin style but not the Storj idea. But it clearly proves that the system is not working efficient.
And the thing is, you cannot always hardware even if you wish, like with the Odroid HC2.

I am pretty sure the system would handle deletes as they come in randomly and spread-out better than the constant 24/7 processing of Bloom filters.

But this is not how it is done today. The current implementation does not take into account the load. And it is not a case against direct deletes. By direct deletes I mean to tell the node what to delete instead of telling it what to keep. We can talk about how the node can handle that.

While I believe the majority of pieces should be deleted right away, you could also pack the information what to delete into the expiry database or send logs hourly to the nodes and they work through that. I agree that with such an async operation an option to do that in a “lazy” fashion should be offered.

2 Likes

But they should not be the primary or even only way of deletions.
Bloom filters should be used only for garbage collection like for such issues:

For such cases Bloom filters are indeed a good thing. For general deletions I believe they are not and I believe they are inefficient. Storj even admitted that when the test data was uploaded with TTL as it would take too long to clear it again from the network with Bloom filters.

And I think it is obvious: Imagine a node with 15 million pieces and you want to delete 1 piece. What is more efficient, to tell the node to delete that specific piece or to scan through 14999999 pieces to find the one that should not be there?
Maybe if you want to delete 14M pieces from that node, it is the other way around. But I beleive nodes tend to get bigger and not delete 99 percent in one go.

There is also no guarantee that the nodes expiry database is not corrupted or got deleted.
There is also no guarantee a node is available when a Bloom filter gets send out.
I think these ar similar cases and we obviously don’t care.

I think we don’t do that for TTL pieces or for Bloom filters. So I don’t think it is a big issue.

But how big is this issue? Storj is all about the availability of pieces aka nodes. At any given time a customer must be able to download his files. This means that the majority of nodes are online and pieces are available at any given time. If that was not the case the whole Storj idea would not work. If pieces are available and online you can delete them. The minimum we see is that 29 pieces are required from 80. Which means that a good third of pieces would be available for immediate deletion at any given time, probably even more. So I am not sure if the question of availability for immediate deletion is a good reason to not do that. And I need to repeat that the idea of indirect deletes causes a lot of secondary issues that we see throughout the forum every day.

My idea of how it should be going is as follows:

  1. Primary way of deletion should always be the attempt to delete the piece from the node immediately. Directly as “tell the node which piece to delete” and as real-time as possible → This would mean you could delete a minimum of 30% right away.
  2. If node is not available for deletion when the request is sent, send them a deletion log for later processing on an hourly or daily basis. This would catch probably the majority of nodes that were offline when the deletion request was issued. As idea it came into my mind that you could make the logs available for download instead of sending them. So that the node could download them when it is available again which means you don’t have to retry the deletion requests.
  3. As last step Bloom filter - maybe once every 31 days (because node must not be offline longer than 30 days) - should be issued only to catch any leftovers like from the bug mentioned above. That’s the only case for indirect deletion to clear out any leftovers that weren’t caught before.

Here is my vision of how I think the process should look like:

  1. Customer wants to delete a file. He sends out deletion request to satellite and receives list of ips where pieces are stored.
  2. Customers uplink sends out deletion request to one or more nodes in parallel.
  3. Nodes delete their piece on receiving the deletion signal and acknowledge to customers uplink and send deletion confirmation to satellite so satellite can track progress
  4. For customer as soon as he receives the 1st acknowledge the file is considered “deleted” while the process is going on in the background. So customer is happy as deletions appear to be real-time.
  5. In the background ongoing deletion could be done in 2 ways:
  • The satellite could take care of the process and tell the nodes what to delete. This could be done by sending a request to the node to delete a specific piece or to send a log files. I have that idea that you could reserve some tiny space for each node on the Storj network where satellite could upload such information and node could be download it from. This way satellite would not have to send our multiple retries.
  • Other idea would be to do deletions peer to peer where the deletion request gets send from one node to the next set of nodes. I don’t know what this concept is called but it’s like first node has the full list of pieces and ips. It deletes what it has divides the list in 2 lists and sends it out to 2 other nodes. These do the same (deleting, dividing, sending) and this gets repeated and repeated. The idea is a geometric progression (like 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288 etc., power of 2 or any other value) of nodes doing their deletions so that even large files with a huge number of distributed pieces could be deleted fast without putting the burden of contacting all nodes only on one single node.
  1. During that process, the satellite should receive deletion confirmation for every completed deletion (similar to the orders we have today) so it can keep track which pieces are gone and which pieces are left for whatever reason.
  2. At the end finally a Bloom filter will be created sent out periodically aka every 31 days that clears out anything that is still left.

I am very convinced when I see all the problems on my nodes and in the forum and imagine the nodes will grow larger in the future, that the approach to use Bloom filter for regular deletes is not the best choice. It seems to cause more problems.

The reason I am bringing this up is because I am seeing my nodes constantly loaded, processing a single Bloom filter and deletions can take up to a week and more while the disks working 24/7 and still I see things like

ls config/retain
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722448799995125000.pb  ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1722880799999530000.pb
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722802182043487000.pb  v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1722967199982862000.pb
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa-1723053598309773000.pb

or

ls /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
2024-07-05  2024-07-13  2024-07-19  2024-07-25  2024-07-31  2024-08-06 2024-08-12
2024-07-06  2024-07-14  2024-07-20  2024-07-26  2024-08-01  2024-08-07 2024-08-13
2024-07-07  2024-07-15  2024-07-21  2024-07-27  2024-08-02  2024-08-08
2024-07-08  2024-07-16  2024-07-22  2024-07-28  2024-08-03  2024-08-09
2024-07-09  2024-07-17  2024-07-23  2024-07-29  2024-08-04  2024-08-10
2024-07-12  2024-07-18  2024-07-24  2024-07-30  2024-08-05  2024-08-11

where there is still trash from almost 6 weeks left.

This needs to be done under any circumstances because it would also help other Filewalkers and overall system performance as well.

Also there are more ideas to further improve Bloom filters and how pieces are deleted.
Of course optimizations need to be followed and proceeded.

6 Likes

Here is how I think we are currently operating in a humorous tale created with the help of AI:

A Day at McDonald’s: A Tale of Chaos and Cheeseburgers

A hungry customer walked into McDonald’s, his stomach growling with anticipation. He approached the kiosk, his fingers flying across the screen as he ordered a simple cheeseburger. Little did he know, he was about to embark on a wild adventure.

As he waited, more and more customers randomly flooded in, all craving their own cheeseburgers. The kiosk beeped and booped, spitting out order after order, each one piling up on the kitchen counter like a digital snowdrift.

Meanwhile, in the kitchen, the employees were in a state of suspended animation, waiting for the signal to start cooking. The orders stacked up in hundreds, a towering monument to the power of hunger and impatience.

Then, like a starting gun, the manager yelled, “Go for it, guys!!!” The employees sprang to life, their movements a blur as they frantically prepared the orders.

But there was a catch. Instead of writing down what the customers wanted, the orders listed what they didn’t want. The customer’s order read like a culinary laundry list: “No Big Mac, no Coke, no Fries, no Quarterpounder, no Hamburger…” The employee’s eyes scanned the list, his brow furrowed in confusion, until he finally figured it out: “Ah, this customer wants a cheeseburger!”

As the kitchen erupted into a frenzy of sizzling patties and flying buns, the customer waited patiently, his stomach growling louder by the minute. The restaurant was packed, the air thick with the smell of grease and anticipation.

The employees worked at a fever pitch, their movements a choreographed dance of burger preparation. Buns were toasted, patties were flipped, cheese was melted, and pickles were applied with military precision.

Finally, after what seemed like an eternity, the customer’s cheeseburger was ready. The employee proudly presented it to him, a golden-brown masterpiece of culinary art.

The customer took a bite, his eyes widening in delight. “While it took some time, this is the best cheeseburger I’ve ever had!” he exclaimed.

The employee beamed with pride, his exhaustion forgotten in the face of a job well done. “Glad you like it, sir,” he said, wiping the sweat from his brow. As he turned to his manager, he whispered, “You know, boss, couldn’t we just run the restaurant like a normal place? You know, take orders, cook food, serve customers? This whole ‘wait for the signal’ thing is killing me.”

The manager just chuckled and patted him on the back. “That’s just the way we do things around here, kid. Now get back to work and start prepping for the lunch rush!”

This creates a high load on the satellite and also slow down the client. We moved to async deletes exactly because of that.

This is not true. Search for -gc, you will see that it used the lazy mode, if it’s enabled.

It’s heavily impacted the response time of the satellite (because it should use a sync calls to the nodes) and to the client to report that pieces are deleted. We even have had timeouts on a noticeable amount of requests.

Direct deletion is not practical, despite some benefits for nodes in exchange for a large impact on clients in terms of response time. When a timeout occurs, some objects will not be deleted.

I think this part will be good solution, that may be satellite just put list of deletes and node grab it once or twice a day and make deletes, report about deleted. Client part can work as today. One problem, we need to embed storj client to node, that node can grab files from network itself.

1 Like

You are taking my statement out of context. I was saying the opposit. With the recent improvements I am willing to give up on direct deletes. Especially the new badger cache seems to make garbage collection fast enough on my nodes.

I don’t see what exactly causes the high load. Maybe it is a misunderstanding. Let’s try to solve that.
First of all I am under the impression that compared to the times you are mentioning, the satellite infrastructure has improved. So maybe it will handle it better today.
Second it is hard to imagine that the satellites can handle millions of requests for uploads, downloads etc. but cannot send out requests for deletion.
But even if that is the case, what I mean by direct deletes does not necessarily mean real-time. I would see that as optimal solution, but what I want to emphasize by direct deletes is to tell the node exactly which piece to delete and not have it scan all pieces to find the piece that should not be there.
It would be astounding to me if the satellites cannot provide that information.
Thirdly my understanding is that also the creation of the Bloom filter is a heavy process. If I remember correctly in the past we were always told that sending of more frequent Bloom filters is not possible as it is too heavy to run the required processes more often.
I can remember that there was issues with the client when direct deletes were in place, however if you read my suggestion that made clear that I am totally fine with an async operation so that the client receives a delete information as quickly as possible and the process goes on in the background. So there is no change in that.

Yes, you are right. I forgot about that the process on the node can make use of the lazy filewalker if not disabled. I have said that this is an option for async operation that should be available. But that’s not an argument against direct deletes. Direct deletes should still result in less Bloom filter creation, less retaining, less garbage, less trash and faster deletion.

But why? You are basically saying the system can handle uploads and downloads in real-time but not a deletion request. This does not make sense to me because I don’t see a difference to an upload or download request and satellites receive millions of those and can obviously handle them.
I also don’t see why the satellites or the client have to wait for anything, as said, direct delete does not necessarily mean real-time:
Customer sends deletion request to satellite. For the customer side we could stop here. As soon as the request for deletion is received, we can signal the customer that the file is deleted. The process can go on in the background carried out by the satellite.
However an idea to reduce load for the satellite, what the satellite can do, similar to upload or download requests, provide a list of IPs and pieces to the client. With that list, the client can contact the nodes directly with the deletion request. Again, it is my understanding, that this is exactly what we are doing when a client wants to upload or download.
Even here I think we don’t have to wait for anything as long as we can make sure that the process can go on in the background. It would be enough that the client sends out the wish for deletion and the node acknowledges that it has received that request and put it in a queue (expiry database). From there it can fulfill the request independently exactly like it deletes pieces that have their expire stored in the expiry database. For that we also do not need any client connection.
Same goes for deletion requests that come from the satellite. We do not have to wait until the piece is finally deleted we only transmit the request to the node.
When the node finally deletes it it should create an order file and send that to the satellite. This serves as receipt that the file has been deleted. Then the satellite can mark it in the database as gone.

At least this is the way I see it.

That’s directly related to the trash unpaid discussion. Of course one benefit would be to reduce the large amounts of unpaid garbage. Imagine Storj would have to pay $10/TB for that. I am sure you would find a different solution very quickly.

No, I don’t think so. The context was about the question of whether direct deletes cause the same amount of work as the usage of Bloom filters. And obviously, you have criticized in the past that Bloom filters consume a lot more IOPS. So, Bloom filters seem to have that as a general disadvantage over direct deletes. That’s also why you are trying to mitigate that with the new cache. However, this does not change the inherent disadvantage of Bloom filters which was the point I was trying to make.

5 Likes