How bloom filters work

Toyoo · March 2, 2021, 11:11am

Some simple explanation of Bloom filters, maybe it will help. Note that the following is not an exact description of Bloom filters as implemented in Storj, rather a generic description of Bloom filters with some guesses as to how it applies to the garbage collection process.

Imagine you have data about English words. Each file is an entry about one word, its definition, example usages, etymology, etc. My local dictionary lists 102401 words like yodeling, lofts, particularly, fatigue, standing, contour etc., and a single node may keep data about tens of words at a time, which is already a long list. Imagine calling a SNO over a phone and telling them 30 words.

Instead, you call your SNO and tell that they are supposed to have data only about words whose names use letters “abcdeghiklmnoprstuvy”. So now you need to tell them up to 26 letters, regardless of the number of words the SNO is expected to store! This is your Bloom filter. yodeling is fine, standing is also fine—these are the words you expect them to have. But if, let say, the SNO had a word fatigue, as it has the letter f, the SNO can remove it, because you no longer expect that SNO to have any words with f. On the other side, your SNO will think they need to store words particularly and contour. Last Friday you tried to call them and tell you want them removed, but their phone was turned off.

Now, after a week, SNO is requested to add three new words and remove two old ones. The list of letters is now “abcdeghiklmopqrstuvy”. contour no longer qualifies (the letter n disappeared), but particularly still fits. As the list of letters didn’t change much, the false positives will likely be mostly the same.

Bloom filters are like that, except instead of letters in a word they use bins formed from cryptographic hashes, so you can make more bins (not just 26 letters; more like thousands or even millions of different letters—imagine Chinese characters ;-)). The more bins, the harder to make mistakes, but also more data to send every time, so it’s a trade-off. Storj chose the trade-off to be so that the rate of mistakes is 10%. Besides, your node is expected to stay online most of the time, so if there’s garbage, it’s only because the node was offline at the time when the satellite tried to tell the node to remove them.

What is specifically relevant to this thread is that if SNO missed the request for deletion of contour, this word will only get garbage-collected when the filter loses at least one of its letters. So if SNO were to be expected to keep tour and constraint for the next ten years, they would never learn that contour was to be removed.

There are some tricks to quickly compute which bins are required, and the node can store some information on which bins correspond to which files, so that removal is fast. There’s also the trick of using multiple hashes instead of a single one. But I don’t know enough details of Storj’s implementation of the Bloom filters, so can’t comment on that.