Drop Bloom filters, bring back direct deletes

jammerdan · August 13, 2024, 4:17am

But they should not be the primary or even only way of deletions.
Bloom filters should be used only for garbage collection like for such issues:

github.com/storj/storj

Cancelled upload files aren't immediately deleted

opened 11:59PM - 18 Jul 24 UTC

zipiju

Bug

Noticed this on the forum, where an user was asked to create a bug report after …finding this bug, but I'm seeing none here, so creating one after was able to reproduce the issue. It looks like the node isn't actually deleting files from uploads that were cancelled at some point. This probably is a significant source of garbage and additional IOPS, especially in situations where the upload success rate tanks - either because the disk subsystem cannot keep up (as during the filewalks, and even with SSD cache in front of the platter drive), or when the circuit bandwidth is all used up. As per the logs, these example uploads were all cancelled, the files were however still created in the blobs folder: ``` # tail -f /var/log/storj/storj.log | grep cancel 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "B4PJDU3TYYRUA2VJTRAPJ65T2Z3R6PZ76GVCJNCI6N4WU3B7MMKA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "109.61.92.82:39516"} 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "UZ7YJRS3JNNBOK22N562O7SHNNXWLFJOF7IYVVGRACZTJONLG7TQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.213.34:38270"} 2024-07-19T01:35:30+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "EBSPEU6RN5765OKD3I2EBRW73FROL3MEKCHTSQQTIK2KV3YFSCUQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.213.33:33718"} 2024-07-19T01:35:31+02:00 INFO piecestore upload canceled (race lost or node shutdown) {"Process": "storagenode", "Piece ID": "RTPUVF3NFLXQU6E2WB4TMH5XYEPAPHKBPYQJBYQVF7YO2ENR2KSA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.205.228:50688"} # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/b4/pjdu3tyyrua2vjtrapj65t2z3r6pz76gvcjnci6n4wu3b7mmka.sj1 -rw-r--r-- 1 storj storj 1.3K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/b4/pjdu3tyyrua2vjtrapj65t2z3r6pz76gvcjnci6n4wu3b7mmka.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/uz/7yjrs3jnnbok22n562o7shnnxwlfjof7iyvvgracztjonlg7tq.sj1 -rw-r--r-- 1 storj storj 16K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/uz/7yjrs3jnnbok22n562o7shnnxwlfjof7iyvvgracztjonlg7tq.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/eb/speu6rn5765okd3i2ebrw73frol3mekchtsqqtik2kv3yfscuq.sj1 -rw-r--r-- 1 storj storj 15K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/eb/speu6rn5765okd3i2ebrw73frol3mekchtsqqtik2kv3yfscuq.sj1 # ls -lah /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rt/puvf3nflxqu6e2wb4tmh5xyepaphkbpyqjbyqvf7yo2enr2ksa.sj1 -rw-r--r-- 1 storj storj 30K Jul 19 01:35 /mnt/storj-data/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/rt/puvf3nflxqu6e2wb4tmh5xyepaphkbpyqjbyqvf7yo2enr2ksa.sj1 ``` I would say this should be reworked to hold the piece all in memory until the upload is confirmed or cancelled, and only after this decision is made it should be written to the drive. There is this filestore.write-buffer-size configuration option (set to 4MiB on this particular node), which would suggest this would be the case, but this apparently isn't working correctly.

For such cases Bloom filters are indeed a good thing. For general deletions I believe they are not and I believe they are inefficient. Storj even admitted that when the test data was uploaded with TTL as it would take too long to clear it again from the network with Bloom filters.

And I think it is obvious: Imagine a node with 15 million pieces and you want to delete 1 piece. What is more efficient, to tell the node to delete that specific piece or to scan through 14999999 pieces to find the one that should not be there?
Maybe if you want to delete 14M pieces from that node, it is the other way around. But I beleive nodes tend to get bigger and not delete 99 percent in one go.

There is also no guarantee that the nodes expiry database is not corrupted or got deleted.
There is also no guarantee a node is available when a Bloom filter gets send out.
I think these ar similar cases and we obviously don’t care.

I think we don’t do that for TTL pieces or for Bloom filters. So I don’t think it is a big issue.

But how big is this issue? Storj is all about the availability of pieces aka nodes. At any given time a customer must be able to download his files. This means that the majority of nodes are online and pieces are available at any given time. If that was not the case the whole Storj idea would not work. If pieces are available and online you can delete them. The minimum we see is that 29 pieces are required from 80. Which means that a good third of pieces would be available for immediate deletion at any given time, probably even more. So I am not sure if the question of availability for immediate deletion is a good reason to not do that. And I need to repeat that the idea of indirect deletes causes a lot of secondary issues that we see throughout the forum every day.

My idea of how it should be going is as follows:

Primary way of deletion should always be the attempt to delete the piece from the node immediately. Directly as “tell the node which piece to delete” and as real-time as possible → This would mean you could delete a minimum of 30% right away.
If node is not available for deletion when the request is sent, send them a deletion log for later processing on an hourly or daily basis. This would catch probably the majority of nodes that were offline when the deletion request was issued. As idea it came into my mind that you could make the logs available for download instead of sending them. So that the node could download them when it is available again which means you don’t have to retry the deletion requests.
As last step Bloom filter - maybe once every 31 days (because node must not be offline longer than 30 days) - should be issued only to catch any leftovers like from the bug mentioned above. That’s the only case for indirect deletion to clear out any leftovers that weren’t caught before.

Here is my vision of how I think the process should look like:

Customer wants to delete a file. He sends out deletion request to satellite and receives list of ips where pieces are stored.
Customers uplink sends out deletion request to one or more nodes in parallel.
Nodes delete their piece on receiving the deletion signal and acknowledge to customers uplink and send deletion confirmation to satellite so satellite can track progress
For customer as soon as he receives the 1st acknowledge the file is considered “deleted” while the process is going on in the background. So customer is happy as deletions appear to be real-time.
In the background ongoing deletion could be done in 2 ways:

The satellite could take care of the process and tell the nodes what to delete. This could be done by sending a request to the node to delete a specific piece or to send a log files. I have that idea that you could reserve some tiny space for each node on the Storj network where satellite could upload such information and node could be download it from. This way satellite would not have to send our multiple retries.
Other idea would be to do deletions peer to peer where the deletion request gets send from one node to the next set of nodes. I don’t know what this concept is called but it’s like first node has the full list of pieces and ips. It deletes what it has divides the list in 2 lists and sends it out to 2 other nodes. These do the same (deleting, dividing, sending) and this gets repeated and repeated. The idea is a geometric progression (like 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288 etc., power of 2 or any other value) of nodes doing their deletions so that even large files with a huge number of distributed pieces could be deleted fast without putting the burden of contacting all nodes only on one single node.

During that process, the satellite should receive deletion confirmation for every completed deletion (similar to the orders we have today) so it can keep track which pieces are gone and which pieces are left for whatever reason.
At the end finally a Bloom filter will be created sent out periodically aka every 31 days that clears out anything that is still left.

I am very convinced when I see all the problems on my nodes and in the forum and imagine the nodes will grow larger in the future, that the approach to use Bloom filter for regular deletes is not the best choice. It seems to cause more problems.

The reason I am bringing this up is because I am seeing my nodes constantly loaded, processing a single Bloom filter and deletions can take up to a week and more while the disks working 24/7 and still I see things like

ls config/retain
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722448799995125000.pb  ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1722880799999530000.pb
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722802182043487000.pb  v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1722967199982862000.pb
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa-1723053598309773000.pb

or

ls /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
2024-07-05  2024-07-13  2024-07-19  2024-07-25  2024-07-31  2024-08-06 2024-08-12
2024-07-06  2024-07-14  2024-07-20  2024-07-26  2024-08-01  2024-08-07 2024-08-13
2024-07-07  2024-07-15  2024-07-21  2024-07-27  2024-08-02  2024-08-08
2024-07-08  2024-07-16  2024-07-22  2024-07-28  2024-08-03  2024-08-09
2024-07-09  2024-07-17  2024-07-23  2024-07-29  2024-08-04  2024-08-10
2024-07-12  2024-07-18  2024-07-24  2024-07-30  2024-08-05  2024-08-11

where there is still trash from almost 6 weeks left.

This needs to be done under any circumstances because it would also help other Filewalkers and overall system performance as well.

Also there are more ideas to further improve Bloom filters and how pieces are deleted.
Of course optimizations need to be followed and proceeded.