How bloom filters work

Alexey · January 28, 2024, 5:42am

This is a side effect, not the goal. Technically you can remove it, but if there would be need to restore it - your node will be disqualified.

snorkel · January 28, 2024, 6:03am

What’s the difference if the piece remains in place or is moved to trash? It is still there and can be used. If the sat needs it, the node will find it, remove the expiration time and the piece remains.
For the payment part… the node would know to make the difference; it will count the good pieces in “storage used” and marked pieces in “trash”. So it will show you on the dashboard both, like now. The satellites know what to pay for, like now. You won’t see them in different folders like now, but who realy stais and counts files in Trash anyways? And even now we have a big difference between sats and nodes, and we are not payed for the difference?
So what will change in the payment scheme? Nothing!

Alexey · January 28, 2024, 6:17am

You need to store the expiration date somewhere. The moving to the trash is an easy one - you may just check the last modification date and compare with the current date, fast and simple.
Of course, you may also use a header of each piece, but this is mean, that each piece must be modified, but since it has a signature - this invalidate it. The other way is to store in the database… Ok, and if the database is corrupted, these pieces will be never removed.
So, the chore for the Trash is another trade off between the durability and speed.

Toyoo · January 28, 2024, 10:54pm

In memory. When a node receives a bloom filter, the node is supposed to deal with it with no delay. And as soon as pieces were scanned and moved to trash, the bloom filter is no longer useful. The issue is that this design did not take into account slow storage, and hence the ticket quoted by @jammerdan.

Moving to trash is exactly a way of marking “for deletion”. It has the nice features of (1) being simple, (2) being guaranteed to work atomically, (3) disabling access to the file, (4) with no performance impact on download requests, (5) being guaranted to work on all file systems. It is indeed kinda slow though.

You cannot assume inodes have a way to store additional information (usually called extended attributes), because not all file systems support them. And of those file systems that do, sometimes they come with performance overhead on each file access, not just trash handling.

But there are some other possibilities which could be explored by courageous storage node operators with coding skills.

pangolin · May 25, 2024, 4:22pm

I am still not sure what information the bloom filter holds. Is it segments or pieces? If it is segments, doesn’t this require to read all piece headers to check against the filter?

BrightSilence · May 25, 2024, 4:58pm

It doesn’t contain either, but the filter is matched against pieces.

Mitsos · May 25, 2024, 5:07pm

In simple terms it’s something like this:
Your node stores pieces 1-100.
Bloom filter says: “Keep every piece starting with 1, 2 and/or 3.”
Your node keeps pieces 1,2,3,10-39, 100. It deletes (moves to trash actually) every piece 40-99 + 4-9.

Edit: too much coffee today

pangolin · May 25, 2024, 5:15pm

I am knowing how a bloom filter works in general. But my question is still not answered.

What data is used to build the storj bloom filter? What is 1…100? Is it piece IDs?

Mitsos · May 25, 2024, 5:18pm

Yes, piece IDs. The satellite sends a list of pieces your node should keep. The downside is that it needs to lean on the safe side, so the bloom filter contains more files than should be kept, otherwise you run the risk of deleting something that you shouldn’t have.
This gets better with the new bigger blooms.

pangolin · May 25, 2024, 5:26pm

Piece IDs obviously making sense.

But there is another thread where a storjling said satellites don’t have a database of piece IDs…

Toyoo · May 25, 2024, 5:36pm

Piece IDs are constructed from segment IDs in a simple deterministic way, so they don’t need to be stored explicitly.

pangolin · May 25, 2024, 5:49pm

This was the missing piece in the puzzle.

jammerdan · June 9, 2024, 11:53am

I cannot guarantee this is correct, because this is what an AI-bot told me about Bloom filters:

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It’s a clever way to quickly determine whether an element is present in a large dataset without having to store the entire dataset.

Here’s how it works:

A Bloom filter consists of an array of bits, initially set to 0.

When an element is added to the filter, it’s hashed multiple times (using different hash functions) to generate multiple indices in the array.

The bits at these indices are set to 1.

When querying the filter to see if an element is present, the same hash functions are used to generate indices.

If all the bits at these indices are 1, it’s likely that the element is present in the set (but there’s a small chance of false positives).

If any of the bits are 0, the element is definitely not present in the set.

Bloom filters have several advantages:

They’re very space-efficient, as they only require a small amount of memory to store the bit array.

They’re fast, with constant-time lookup and insertion operations.

They’re probabilistic, meaning they can be tuned to trade off between false positives and space efficiency.

However, Bloom filters also have some limitations:

They can’t remove elements from the set (once an element is added, it’s there forever).

They can’t tell you whether an element is definitely present or not (only whether it’s likely present or definitely not present).

Bloom filters are commonly used in various applications, such as:

Caching: to quickly determine whether a cache hit or miss

Network routing: to quickly determine whether a packet should be forwarded or dropped

Database query optimization: to quickly determine whether a query can be answered from a cache or needs to access the underlying database