Disk usage discrepancy?

Alexey · January 16, 2024, 8:40am

This command will work only if you add all untrusted satellites to the untrusted list, as specified in the article. If you did not - you need to provide them right in the command, like this:

you should not disable a filewalker, only a lazy mode.

HGPlays · January 16, 2024, 8:49am

Okay that seems to work for me:

i have enabled file walker again and hoping that will help on the file amount

HGPlays · January 16, 2024, 9:45am

I guess i will have to continue watching to see if space will decrease.

elek · January 16, 2024, 10:01am

I know

And also considering to create a youtube video to explain it.

But until that (with more context):

Satellite (metadata server) and Storagenodes should agree on what pieces are stored.

There are multiple solutions for this. For example with Apache Hadoop HDFS / Ozone the Storagenods (they have different names, but I use the Storj names here) reports back to the stored pieces to the metadata server.

It has hard scalability issues, as reports can be very large. To fix this, they implemented incremental reports, which has own problems…

Storj uses the opposite direction, the Satellite sends the list of the stored pieces to Storagenodes, and all the pieces which are not in the list can be deleted.

But still the full list would be a very huge list (like Gigabytes). Instead of a huge list, Storj uses bloom filters which is a probabilistic data structure.

It can categorize each pieces:

surely can be deleted
definitely should be kept

Bloom filter is very small (like 1-2 mbytes), but in exchange, it may miss some deletes (it’s never wrong about the files which should be kept). But eventually all the files will be deleted. (0.5-1.5% overhead is possible, but seeing 7 TB used space vs 4 TB space reported by satellite is a bug.)

An issue just created to double check the current behavior / parameters of the full bloom filter.

github.com/storj/storj

:open_umbrella: {storagenode,satellite}/gc: bloom filters are ineffective with large storage nodes

opened 11:41AM - 11 Jan 24 UTC

egonelbre

Currently we have a maximum memory limit for bloom filters, however that has a s…ide-effect of them being completely filled with nodes with large number of pieces. With simulating the bloom filter effectiveness we can see different behaviours. https://github.com/storj/experiments/blob/main/simulate-bloom-filter/main.go Here's an approximate results for different piece counts and max bloom sizes: ``` satellite add/delete storage-node bloom-size ideal-bloom-size 1_000_000 50_000 1_005_000 583 KiB 583 KiB 2_000_000 50_000 2_005_000 1.1 MiB 1.1 MiB 4_000_000 50_000 4_010_000 2.0 MiB 2.3 MiB 8_000_000 50_000 8_060_000 2.0 MiB 4.6 MiB 10_000_000 50_000 10_015_000 2.0 MiB 5.7 MiB 14_000_000 50_000 14_450_000 2.0 MiB 8.0 MiB 16_000_000 50_000 16_500_000 2.0 MiB 9.1 MiB (almost unstable) 20_000_000 50_000 unstable 2.0 MiB 11.4 MiB 20_000_000 50_000 20_130_000 4.0 MiB 11.4 MiB 26_000_000 50_000 26_300_000 4.0 MiB 14.8 MiB (almost unstable) 26_000_000 50_000 26_140_000 5.0 MiB 14.8 MiB ``` So, if we use the currently calculated optimal size we are going to have a significantly smaller overhead than our false positive rate. Randomizing the seed significantly does clearly help. This however falls apart when the bloom filter is completely filled -- seems to happen around 16-20M pieces for 2MiB and 22-26M pieces for 4MiB. Currently our largest node has 26M pieces, so bumping the bloom size to 4MiB will probably help. We may also want to adjust our bloom filter size calculation to suggest 2x smaller (or 1.5x smaller) bloom size than the theoretical result suggests. Bumping it to 5MiB should solve it somewhat, however we need to be mindful of drpc packet limit, which may need to be changed -- alternatively, we need a new message type to send larger bloom filters. One interesting approach to try is to create bloom filters only for a subsection of piece ID-s, rather than all of them. This should allow to shrink the number of piece-ids put into the bloom filter, at the cost of longer tail to cleanup the thrash. If we split it into two, e.g. only pieces `<0x80...`, then our ideal sizes should be half as they are now. --- ```[tasklist] ### Draft action items - [ ] https://github.com/storj/storj/issues/6690 - [ ] https://github.com/storj/storj/issues/6733 - [ ] https://github.com/storj/storj/issues/6691 - [ ] https://github.com/storj/storj/issues/6802 - [ ] https://github.com/storj/storj/issues/6770 - [ ] Adjust bloom filter parameters such that they are smaller. We should experiment a bit more, but it does seem like 1.5x smaller size should be safe. - [ ] Try piece-id selection strategy. (e.g. what if bloom filter ignored half or quarter of the pieces) - [ ] Add a log/monkit warning when the fillrate of a bloom filter is above 0.95. ```

snorkel · January 16, 2024, 2:41pm

@Alexey

Alexey · January 17, 2024, 12:37am

It need to pass several GC filewalkers, so, it may take several weeks.

HGPlays · January 17, 2024, 8:46am

Just to understand. after having run the forget satelites do i then need to wait 1 week for GC to come clean it? i still have not regained and good amount of space so this is still an issue

AiS1972 · January 17, 2024, 11:55am

Hello elek!

We all know that satellites have performance problems.
What is the reason - in the architecture of the storj or in the desire to save satellite resources, or both of these points, unfortunately this topic has not been covered.
Yes, this probably doesn’t matter, for us node operators. Because before the bloom filter there was no such thing.

To delete or not to delete - that is the question.
But for the node operator this does not result in 1%, but for example in my case 15-20% on old nodes.

So saving the satellite resource with the bloom filter is good, but it seems that the node operators do not receive payment for the pieces they store.

There are also a few more points that are not entirely clear to me.

Why, after deleting the satellites, did storj not simply release a new version of the software that would remove any traces of the deleted satellites? So it’s a monkey’s work for operators to go and check whether the directories have been deleted and to execute any additional commands.
Why does a truant always start walking all over again - why can’t one store progress on a satellite and start not from the beginning, but from the place of the previous state?
I think the bloom filter has not passed the reality check and we need to return to the previous deletion scheme, perhaps sending not single deletions, but entire packages of deletions with one command. And it should be a transactional model, not a probabilistic one. Otherwise, unaccounted data on the nodes will only grow.

jammerdan · January 17, 2024, 12:20pm

Unaccounted data is the word.
Do I understand correctly: Customer deletes data. Instead of immediately deleting it and freeing space on SNOs node it takes a week until the bloom filter has been prepared and sent out. Then the data is sent to trash where it resides another week. And additionally there is still data to delete left because the bloom filter does not cover it all so it needs several passes aka weeks until the data is finally cleared? And all of the unpaid for the SNO?

AiS1972 · January 17, 2024, 12:30pm

The bloom filter does not delete transactionally - guaranteed and documented, but probabilistically - you can delete it, but you can not. These are two big differences.

elek · January 17, 2024, 1:50pm

I would rather avoid these kind of generic statements. Satellites are working well, thanks to the continuous improvements. Usage is also increasing, therefore newer and newer challenges should be solved.

The reason behind the async deletes can be found in the design doc.

Because the decentralization, especially the federation which is a form of decentralization. What you write is possible, but Storj software tries to be generic, and usable with any satellites depends on the decision of the Storage Node Operator

Again, I am a technical person, and my main focus is solving technical problems, and helping to understand technical question. It’s a very generic statement, and based on technical facts, I have slightly different opinions.

But I understand your unhappiness, because you are affected by a bug, and try to help with the fix.

From technical point of view, bloom filters are working well for most of the nodes, but nodes with huge number of segments might be affected by a bug (which is a bug only thanks to the expansion of the network). Please follow the linked issue for the progress of the fix. (Thanks to your contribution of the piece list, now it’s easy to test any changes).

unrealSpeedy · January 17, 2024, 1:55pm

Sorry i didnt get it completely.
So if we have nodes disk usage discrepancy the Problem will resolve by itself by Running the Node for several weeks so gc and filewalker can do its work?

elek · January 17, 2024, 2:02pm

Yes, just stay tuned.

We are aware of one problem, and working on fixes.

Small discrepancies are possible (due to the architecture, and because it’s a highly distributed system: we couldn’t have one snapshot view from the same moment).
Only big (>15M piece per Satellite) Storagenodes are affected by the (known) bug.
If you have small Storagenodes and large discrepancies → Check if you deleted the data for the legacy Satellites
If you have small Storagenode, large discrepancies for active Satellites: let me know your numbers (I believe there is no such case)

peter_linder · January 17, 2024, 2:09pm

I have read the small discussion on github, regarding the 45M/25M node.

Even if a larger bloom filter would reduce the number of segments to around 25M, there will eventually increase again to around 45M of legitimate pieces, if of course they are the same size. In this case it needs to be large enough to still be effective at 45M pieces that should be kept.

Even larger nodes probably will exist.

Roxor · January 17, 2024, 2:47pm

I can’t even imagine how many files could be on a full node running one of those new 22TB drives?

(maybe nobody has managed to fill one yet?)

peter_linder · January 17, 2024, 2:59pm

I would expect at least 60-70M files.

Ruskiem · January 17, 2024, 5:30pm

well i would count You, got some shy 8TB and filling 10 and 14TB slowly, but need a command for Windows GUI pls anyone, for PowerShell to make You a stats for each satellite folder. because i got a shy suspection that the problem is on smaller nodes as well and i want to help find some clues.

snorkel · January 17, 2024, 6:09pm

So, untill a new bloom filter design is out and we see the results, I would speculate that in the actual conditions, it is better to run multiple small nodes on one big drive, than a single one?
For ex. we should split a 22 TB drive to 3 nodes, 7TB each?
I don’t know how the performance of the entire cluster of nodes will be affected, but this seems the logical conclusion.

Ottetal · January 17, 2024, 6:48pm

The from my earlier post recently got 500GB marked as trash as seen above.

I see this as a huge step forward.

daki82 · January 17, 2024, 10:12pm

still a bad idea, one drive can’t handle 3 nodes. not even cached i guess