A new logo for storagenode

jammerdan · May 13, 2024, 4:51am

I am looking forward very much to the version 1.104 getting deployed. I hope it will be soon and that it will solve a lot. So what you are writing is not even the (main) issue for me.

Look: One of my nodes was at 7.5TB full. Today it is at 1.5TB after all that cleansing that has taken place. And I am even happy that it can take ingress again now.
But the simple question is, why have things implemented the way they have been implemented in the first place?

Today we know that the bloom filter size was not sufficient. Data has not been deleted from the node due to that. Some people don’t understand that it is normal to ask why did you choose such a small size and why has it not been monitored in the wake of growing nodes if the size is still sufficient? I am absolutely fine when we start with a small size for easy and cheap implementation but then I would expect a constant monitoring if the size still fits even more as the move towards larger nodes was foreseeable after the price cuts and growing customer base.

But there were also other implementations that independently from the bloom filter size prevented deletions. The bloom filter implementation so far was that it was kept in RAM only. So as soon as a node restarted the bloom filter was removed and the remaining garbage was not collected.
Again here the natural question: Why was such an implementation chosen when we know that nodes get restarted frequently minimum every 2 weeks for updates but also even more frequent for example when you change the assigned storage space. Again with nodes getting larger (getting even larger because bloom filter garbage could not be collected) nodes were not able to finish before the next restart. And again the question why was such an implementation chosen when it was known that nodes restart frequently. I am not saying such an implementation has to be avoided at all cost, but at least it should be monitored if it works as it should when nodes grow and restart.

And the third one was the implementation of the trash collector that collected during the first run and moved the pieces to trash only after it has finished. Meaning that if it got interrupted nothing gets deleted.

And the fourth is the case of the never ending used-space filewalker that was not able to update the databases correctly when it did not finish before the next restart. Same question, why choose such an implementation when we know the nodes are getting larger, filewalkers running longer and we have frequent restarts due to forced updates?

The question to all these implementation is, why it has been chosen to do it this way despite nodes foreseeably getting larger, filewalker runs taking longer and nodes getting restarted frequently?
The fixes we see today are that we have now larger bloom filters, bloom filters that get stored on disk and file walkers that pick them up and resume, we see collectors that move pieces into the trash immediately instead of waiting till the run is finished and we see even used-space filewalkers - that hopefully will deployed soon - that can resume their runs and start where the left off. But I hope we don’t see the same mistake again:

So these fixes are all great. But these were also greatly needed. And the flaws with them should have been detected sooner, ideally they should have been implemented like we see them today from the beginning.
And this is my main issue, why has it not been implemented like that in the first place? I think this could have and should have been anticipated that it is not a good idea to interrupt long running processes and trash whatever they are working on or make them start from the beginning after restart and forcing frequent restarts on them. At least if you had asked me, I would have said that for me it does not sound like a good idea to trash the bloom filter when the node restarts for example. But if you go that route then at least some monitoring and telemetry checks should put up so that we can see when it starts to fail and a different implementation is needed. Implementations like we see them today that take into account that nodes need more time to do their tasks and we are better off if we store tasks and resume them instead of trashing them seem to be the better ones.
And this is also why these issues I mentioned are not just simple bugs for me. They have been working as intended. Deleting RAM-only bloom filter on restart for example is not a bug. This was working exactly as it was told to work. But it was not the right solution to do it that way.