For direct deletes we would need to keep a record of the pieceIDs that got removed from the network. And we would need to keep the entire list somewhere. Thats a giant object. My last bloom filter deltes 1.3M pieces on US1 alone. I don’t know how what size a single pieceID has but a list with 1.3 Million of them gets expensive.
A bloom filter does the oposit. Instead of tracking the pieces that needs to get cleaned up it tracks the piecs that the node has to keep. That doesn’t need any extra data. It can work with the existing data in the database. On top of that the bloom filter is just 2MB in size in my case.
Don’t get me wrong. There is an argument for direct deletes but more in addition to bloom filter and never as a replacement. And that argument only works until server side copy enters the floor. So with server side copy in mind bloom filter are the best and cheapest option. I am more than happy to store these server side copy files on my storage node. More usage = more payout. I don’t see the point on removing these files from the network just to make things work without bloom filters. First priority should be to make the customers happy and on my node I will tollerate some tradeoffs for that goal.
I think I missed something…
The bloom filter contains a list of 90% of the pieces that should be stored on the node.
The node receives it, retains those pieces, and the rest is sent to trash. But this trash dosen’t also includes the rest of 10% of good pieces?
So the nodes can delete 10% of good pieces too? What I got wrong?
Bloom filters don’t contain ID’s. They are kind of a mathematical trick to enable incredible amounts of compression. The bloom filter is than matched against pieces found on your node using that clever math and it will by definition match 100% of the pieces you should keep. The downside is that it also matches 10% of the pieces you shouldn’t keep. But that 10% allows for the small size. So no, your node doesn’t delete 10% of the pieces you should keep. It keeps all of those. But it also keeps 10% of the pieces that actually should be deleted. Hope that helps. I feel like I’m not explaining it very well.
A database of all existing pieces. They shouldn’t need to store information on removed pieces.
32 bytes.
Each bloom filter has a chance of garbage-collecting each unused piece. So after a piece is removed, if you miss one bloom filter, the next one will do the job.
Bloom filters do not store information on which pieces should be deleted. Please refer to my post linked above for explanation.
In most scenarios, especially ones where you do not have enough memory to cache file metadata, batch removals are faster. Your node only needs to scan each subdirectory once, as opposed to walking them randomly.
Well this might be unfortunate timing, but they are working on upgrading the power connection to my building today and I’m going to have to take my nodes offline for a few hours today. Hopefully I’m not going to miss the resending of the bloom filters. But I guess there’s always next week if I do.
Never mind, they didn’t prepare the work well and had to postone. I will stay online and keep an eye on whether I will receive a bloom filter.
Both of which are entirely unreasonable for Storj. But let’s keep this thread on topic. We’re discussing important impactful issues here and Storj Labs is paying attention to this issue. I don’t want to waste their time by making them wade through off-topic comments and risk them losing interest in this thread.
So that looks it started to work.
Trash down by 1 TB… Haven’t checked the old / 23TB node yet - will do later today (again VPN not working again between my locations :P)
@BrightSilence Fix is getting deployed now. We don’t have the time to resend the old bloom filter. The next bloom filter generation should finish later today. I will watch my node as well.
Not that much use in sending the old bloom filter then anyway. Does that mean nodes should also receive it later today? If so, I’ll check the logs again later today or tomorrow and report back.