By the calculator: satellite expects 5M pieces, so it will generate a filter with 24M bits (3 MB), which is well within the current maximum size of a bloom filter. Then, for each unexpected piece tested there is 10% chance that it will still survive, so on average you expect 1.5M pieces to survive despite not being expected. The subsequent bloom will reduce this number down to 150k pieces.
Consider a simple exercise: you have a dictionary of random English words, and you want to construct a “bloom” filter that only selects the following words: abstract
, aghast
, brass
, chart
, grammar
, mascara
, scarf
, straw
, swatch
, tasty
, tray
, warm
, wrath
, yacht
. So you make the following filter: keep only words that do not have any of the letters eiklnpou
. Each of the letter is a yes/no question to test for, and you only keep a word if the word doesn’t fail any test.
The larger dictionary you started from, the more words you remove, because any random word from outside of the list of words to keep is likely to fail at least one test of the questions: contain at least one of the letter from eiklnpou
. You will obviously sometimes find an unexpected word that by chance also does not contain any of those, but on average the chance is the same for each word, whether you start with a longer or shorter list.
My local /usr/share/dict/american-english
has 104334 entries, and only 1727 would pass all these tests:
% wc -l /usr/share/dict/american-english
104334
% </usr/share/dict/american-english grep -v '[eiklnpou]' | wc -l
1727
If I had a smaller dictionary with, let say 10433 random words (10%), then I’m getting proportionally less words matching the tests as well:
% shuf /usr/share/dict/american-english | head -10433 | grep -v '[eiklnpou]' | wc -l
179
(I’m getting values from 160 to 196)
So I would also fully expect that a bigger dictionary would also have a proportionally more words matching.
Now, whatever observation strays away from this theoretical exposition would mean Storj’s implementation is at fault. For example, not having the right questions to ask. If you want to keep a list of one thousand most often used words in English, a naïve letter-based filter is not enough, because you cannot construct a yes/no question that would be failed by other words: the words you want to keep use all 26 letters commonly used in English. You need a better, bigger set of characteristics to filter by (maybe syllables, or trigrams).
Bloom filters are just a way of constructing those questions with good mathematical properties (like, being very particular about which pieces it chooses) while keeping the filter small. But the principle is the same: satellite sends a list of yes/no questions, any piece that fails at least one of the questions is to be removed. The bigger the filter, the better questions are available.