Disk usage discrepancy?

elek · January 16, 2024, 10:01am

And also considering to create a youtube video to explain it.

But until that (with more context):

Satellite (metadata server) and Storagenodes should agree on what pieces are stored.

There are multiple solutions for this. For example with Apache Hadoop HDFS / Ozone the Storagenods (they have different names, but I use the Storj names here) reports back to the stored pieces to the metadata server.

It has hard scalability issues, as reports can be very large. To fix this, they implemented incremental reports, which has own problems…

Storj uses the opposite direction, the Satellite sends the list of the stored pieces to Storagenodes, and all the pieces which are not in the list can be deleted.

But still the full list would be a very huge list (like Gigabytes). Instead of a huge list, Storj uses bloom filters which is a probabilistic data structure.

It can categorize each pieces:

surely can be deleted
definitely should be kept

Bloom filter is very small (like 1-2 mbytes), but in exchange, it may miss some deletes (it’s never wrong about the files which should be kept). But eventually all the files will be deleted. (0.5-1.5% overhead is possible, but seeing 7 TB used space vs 4 TB space reported by satellite is a bug.)

An issue just created to double check the current behavior / parameters of the full bloom filter.

github.com/storj/storj

:open_umbrella: {storagenode,satellite}/gc: bloom filters are ineffective with large storage nodes

opened 11:41AM - 11 Jan 24 UTC

egonelbre

Currently we have a maximum memory limit for bloom filters, however that has a s…ide-effect of them being completely filled with nodes with large number of pieces. With simulating the bloom filter effectiveness we can see different behaviours. https://github.com/storj/experiments/blob/main/simulate-bloom-filter/main.go Here's an approximate results for different piece counts and max bloom sizes: ``` satellite add/delete storage-node bloom-size ideal-bloom-size 1_000_000 50_000 1_005_000 583 KiB 583 KiB 2_000_000 50_000 2_005_000 1.1 MiB 1.1 MiB 4_000_000 50_000 4_010_000 2.0 MiB 2.3 MiB 8_000_000 50_000 8_060_000 2.0 MiB 4.6 MiB 10_000_000 50_000 10_015_000 2.0 MiB 5.7 MiB 14_000_000 50_000 14_450_000 2.0 MiB 8.0 MiB 16_000_000 50_000 16_500_000 2.0 MiB 9.1 MiB (almost unstable) 20_000_000 50_000 unstable 2.0 MiB 11.4 MiB 20_000_000 50_000 20_130_000 4.0 MiB 11.4 MiB 26_000_000 50_000 26_300_000 4.0 MiB 14.8 MiB (almost unstable) 26_000_000 50_000 26_140_000 5.0 MiB 14.8 MiB ``` So, if we use the currently calculated optimal size we are going to have a significantly smaller overhead than our false positive rate. Randomizing the seed significantly does clearly help. This however falls apart when the bloom filter is completely filled -- seems to happen around 16-20M pieces for 2MiB and 22-26M pieces for 4MiB. Currently our largest node has 26M pieces, so bumping the bloom size to 4MiB will probably help. We may also want to adjust our bloom filter size calculation to suggest 2x smaller (or 1.5x smaller) bloom size than the theoretical result suggests. Bumping it to 5MiB should solve it somewhat, however we need to be mindful of drpc packet limit, which may need to be changed -- alternatively, we need a new message type to send larger bloom filters. One interesting approach to try is to create bloom filters only for a subsection of piece ID-s, rather than all of them. This should allow to shrink the number of piece-ids put into the bloom filter, at the cost of longer tail to cleanup the thrash. If we split it into two, e.g. only pieces `<0x80...`, then our ideal sizes should be half as they are now. --- ```[tasklist] ### Draft action items - [ ] https://github.com/storj/storj/issues/6690 - [ ] https://github.com/storj/storj/issues/6733 - [ ] https://github.com/storj/storj/issues/6691 - [ ] https://github.com/storj/storj/issues/6802 - [ ] https://github.com/storj/storj/issues/6770 - [ ] Adjust bloom filter parameters such that they are smaller. We should experiment a bit more, but it does seem like 1.5x smaller size should be safe. - [ ] Try piece-id selection strategy. (e.g. what if bloom filter ignored half or quarter of the pieces) - [ ] Add a log/monkit warning when the fillrate of a bloom filter is above 0.95. ```