Two weeks working for free in the waste storage business :-(

elek · July 9, 2024, 12:29pm

Thanks for all the feedback, (especially the constructive ones)

There are many misunderstandings (and statements which are false) in this thread. Let me share my view:

Solving / improving GC generation is still high priority for us
There were BFs generated during the last month, and they are sent out. (last time 3 days ago).
We almost solved the majority of discrepancy problems when the recent new load re-opened it. Earlier we had 20-35M pieces / node max. Now it’s possible to have 150M+.
We are bumping the max BF size more and more (last time, 1-2 weeks ago, it was bumped to 25MB). You may receive smaller BF if you don’t have enough pieces.
Bigger max size requires bigger memory, OOM may happen if we don’t estimate the memory very well. In this case, we should restart the 2-3 days long process.
We are separating the BF generation per satelilte (originally we generated BFs for all the satellite in the same machine, in rotation.). This is WIP, and will be done very soon. This is one reason why you couldn’t see very regular intervals. It will help to receive BFs mor frequently.
We can further bump the BF size, but the problem is on your nodes. There are lot’s of nodes, which couldn’t walk fast enough to delete pieces in one day. That’s the biggest problem right now.
There are active research to make it better. One very experimental fs stat cache is already committed. It’s not very well tested, so I don’t recommend to use it. (but for me it was 8x speed improvements)
There are other experiments to use different pieces store backends. One approach is using badger (where the metadata is stored in LSM tree, but values in log files).
Badger based approach is promising. It makes the walking process lightning fast (1m for 2.5TB). But it’s far from a final solution. Due to the architecture of Badger (value log files) the write and store amplification is always higher (eg. it may store 1.2-1.8x more data)
But this is a bigger effort. I don’t expect to have different backends in the near future (at least not as a stable option)
I have limited confidence in the reported numbers. I am aware of one issue: `piece_spaced_used.db` database contains space-usage related to old satellites after forget-satellite · Issue #7014 · storj/storj · GitHub which may show increased used space, even if they are already deleted.
Would be better to check the number of piece files in each satellite subdirectory and report discrepancy based on that. It’s very hard to evaluate shared numbers, where I couldn’t be sure what is the source of the numbers and how precise it is…