A new logo for storagenode

littleskunk · May 12, 2024, 3:50pm

You are mixing issues here. The terabytes of customer data was paid data. Storj realized it and decided to pay for the data a bit longer and focus on the customer side first.

I don’t get what this thread is about. If you want to leave the network do it. Why do we have to talk about all this for so long time? You know I could spend my time on something more useful like further performance improvements! So if you don’t have any further questions I am going to leave this thread alone.

arrogantrabbit · May 12, 2024, 5:35pm

The whole thread is quite pointless.

All software has bugs.
Resources are limited, and should be allocated to fixing bugs and other activities according to some priority metric.
Priority is decided by storj, to best serve the company interests.
SNO interests are different, and therefore in their view company will always be spending time on some nonsense instead of fixing this one bug that affects SNOs but nobody else.

There are two possible courses of action for the SNO:

exit the project: nobody needs another source of stress
stay in the project: with the understanding that the company who started it and keeps running it for quite a while, and is the whole reason SNO is a thing in a first place, kind of know what they are doing, and will made decisions in the best interest of the project, that ultimately bring SNO more moneys.

Personally I don’t think half the data being unpaid trash is such a problem. There is plenty of space on the network anyway, and storj payments are pure free money, so complaining about receiving smaller amount of free money is sort of ridiculous.

littleskunk · May 12, 2024, 7:19pm

Mind sharing which code fixes you mean specific? Let us go over the code together and you can show me what you mean with incompetence.

Knowledge · May 13, 2024, 3:36am

No one wants bugs, especially in Production. However, SNO’s are not paying customers. You are exposed to an accelerated release schedule so fixes and updates happen sooner than they would for customers. You should expect some bumps in the road.

Someone like LittleSkunk logs on to the forum and shares details with everyone here as a courtesy. Nobody asked him to do this. He could continue working and avoid some of the negative comments brought to him while trying to provide positive news.

When the forum is hostile and argumentative, people don’t want to come in here and post. It isn’t inviting or friendly. That doesn’t mean you shouldn’t discuss matters of concern, but there are different tones one can choose and words matter. You catch more flies with honey, as they say.

For the record, Alexey, Heunland, Bre and I bring your concerns up to the staff all the time. Every day. Many an Engineer jumps in to answer questions and assist. And Executives have come in here and answered questions, you may just not know who they all are, or you want something more specific than what you have read.

I for one read every single message posted and responded to in this forum. It is a lot of content on any given day. And areas of concern are passed along, all the time. The Admins and Support staff advocate for you, all the time. And the fast responses and off hour time of people like LittleSkunk and others shows that they care as well.

Try and remember it is a small company, not everything is the way you will want it to be, and it isn’t always what the company wants it to be. But as Storage Node Operators, you should consider that the Engineers are trusting that you understand the tradeoffs they have to make at times while they work out issues with changes to the software. And that primary focus is on paying customers and future sales, so the company can grow and more and better processes can be put into place to assist with troubleshooting for all aspects of the platform.

If you feel yourself getting too worked up. Step away for a while, and come back more clear headed. Let’s work together to solve problems and not against one another.

jammerdan · May 13, 2024, 4:51am

I am looking forward very much to the version 1.104 getting deployed. I hope it will be soon and that it will solve a lot. So what you are writing is not even the (main) issue for me.

Look: One of my nodes was at 7.5TB full. Today it is at 1.5TB after all that cleansing that has taken place. And I am even happy that it can take ingress again now.
But the simple question is, why have things implemented the way they have been implemented in the first place?

Today we know that the bloom filter size was not sufficient. Data has not been deleted from the node due to that. Some people don’t understand that it is normal to ask why did you choose such a small size and why has it not been monitored in the wake of growing nodes if the size is still sufficient? I am absolutely fine when we start with a small size for easy and cheap implementation but then I would expect a constant monitoring if the size still fits even more as the move towards larger nodes was foreseeable after the price cuts and growing customer base.

But there were also other implementations that independently from the bloom filter size prevented deletions. The bloom filter implementation so far was that it was kept in RAM only. So as soon as a node restarted the bloom filter was removed and the remaining garbage was not collected.
Again here the natural question: Why was such an implementation chosen when we know that nodes get restarted frequently minimum every 2 weeks for updates but also even more frequent for example when you change the assigned storage space. Again with nodes getting larger (getting even larger because bloom filter garbage could not be collected) nodes were not able to finish before the next restart. And again the question why was such an implementation chosen when it was known that nodes restart frequently. I am not saying such an implementation has to be avoided at all cost, but at least it should be monitored if it works as it should when nodes grow and restart.

And the third one was the implementation of the trash collector that collected during the first run and moved the pieces to trash only after it has finished. Meaning that if it got interrupted nothing gets deleted.

And the fourth is the case of the never ending used-space filewalker that was not able to update the databases correctly when it did not finish before the next restart. Same question, why choose such an implementation when we know the nodes are getting larger, filewalkers running longer and we have frequent restarts due to forced updates?

The question to all these implementation is, why it has been chosen to do it this way despite nodes foreseeably getting larger, filewalker runs taking longer and nodes getting restarted frequently?
The fixes we see today are that we have now larger bloom filters, bloom filters that get stored on disk and file walkers that pick them up and resume, we see collectors that move pieces into the trash immediately instead of waiting till the run is finished and we see even used-space filewalkers - that hopefully will deployed soon - that can resume their runs and start where the left off. But I hope we don’t see the same mistake again:

So these fixes are all great. But these were also greatly needed. And the flaws with them should have been detected sooner, ideally they should have been implemented like we see them today from the beginning.
And this is my main issue, why has it not been implemented like that in the first place? I think this could have and should have been anticipated that it is not a good idea to interrupt long running processes and trash whatever they are working on or make them start from the beginning after restart and forcing frequent restarts on them. At least if you had asked me, I would have said that for me it does not sound like a good idea to trash the bloom filter when the node restarts for example. But if you go that route then at least some monitoring and telemetry checks should put up so that we can see when it starts to fail and a different implementation is needed. Implementations like we see them today that take into account that nodes need more time to do their tasks and we are better off if we store tasks and resume them instead of trashing them seem to be the better ones.
And this is also why these issues I mentioned are not just simple bugs for me. They have been working as intended. Deleting RAM-only bloom filter on restart for example is not a bug. This was working exactly as it was told to work. But it was not the right solution to do it that way.

arrogantrabbit · May 13, 2024, 5:31am

It’s very, very simple. When designing the system like this you really want to err on the side of caution.

It’s infinitely better to accidentally keep something that could have been deleted than delete something that should have been kept.

Because screw up in the former case results in few upset SNO, in the latter — demise of the company.

So all focus is on not losing data. And cleaning out trash is just a cherry on a cake. Nice to have, but lowest possible priority. Both in terms of spending development time, and runtime on the node — you’d rather your node spend resources service customers than sort out trash.

That’s it. That’s the whole reason.

obviously, resumable anything is better, if labor cost is zero.

Implementing resume takes development time. You would need to track progress, save restore state, validate state, etc. and for what? What’s the cost of not doing it? Some Trash on some nodes will persist until next bloom filter. Big deal. It’s nothing. It’s not worth spending development time on.

What I do agree with — there should be a switch to disable collecting local stats. Wasting iops on databases to draw ugly plots in the dashboard is 100% unnecessary gimmick.

I would implement the switch, make it default enabled (I.e. suppress local stats) and then monitor how many operators would disable it to see stats. If this is less than 10% — nuke the code out. Less code — less work to maintain it, and less bugs.

snorkel · May 13, 2024, 7:34am

I advocated for removing databases since 2 years ago. We keep babysitting them and moving them around, and they are not even essential for running a node.
But, as the history showed, they help spotting bugs and wrong software choises.
I don’t know what would be the best option, but it seems a very low priority for the moment.
The certainty is that we need stats. We need to see what’s happening with our nodes, and we need it without third party software.
We need the dashboard, and I don’t know how dashboard will work without databases. Maybe there are better solutions, text files, etc but I’m not en expert on this.
I understand jammerdam’s position; I am in the same position… all my nodes were filling quickly because undeleted garbage, and they filled up, I didn’t know at that time the main cause, and I bought 8 new 22TB drives to expand. After half a year and lots of TB of trash, my first drives still haven’t filled up. So if the nodes were working corectly, maybe I would delayed the punchase 6-12 months.
I regret it? Not realy. It was an unplanned expense, I had to gatter the money quickly, but now I’m ready with all new nodes vetted for big customers; but it was a rushed situation that could have been avoided.
As me and jammer there are many SNO, so I understand why some are angry.
I’m angry with Storj dev team? Until now, not a bit!
I understand how creating and impruving a software works, you have infinite options to do something and you have to choose what seems at that moment the best one or with the best equilibrium between resources spent and results or the one that satisfy the most parties involved. And as time goes by and the feedback in coming, you addapt, you impruve, you change priorities, resourses allocated etc.
As long as I see the involvement of the entire Storj team - coders, management, forum moderators, etc. to listen our needs, sno and customers, and to impruve the software and the entire expirience, I’m happy.
We, as none coders, want everything done now.
But there are steps that must be taken, to ensure the changes are working and make things better than worse.
The way Storj deals with problems until now worked pretty good, and the software is pretty stable. They even increased the speed of new releases to stop our whinings, but I would advise to caution, not too fall in the other extreme - too fast releases and more bugs introduced than solved.
So have some patience and let them come up with solutions in a proper manner.

ACarneiro · May 13, 2024, 7:50am

Because implementing the changes required to fix these bugs would mostly benefit SNOs and not the network itself and provide no added value to the company.
There is a ton of free space. Who cares if some nodes have more garbage in them than they should?
Storj made a judgement call that developer resources should be allocated somewhere else because fixing the problems you mentioned at that time would bring negligible improvement on the network.

I think the key is that this community spotted the issues and reported them and Storj acknowledged it. We have done our bit (arguably, people like you and quite a few tech-savvy SNOs have gone above and beyond by spotting problems and suggesting solutions to them).

What they do with that information then becomes a strategic decision for the company bosses to make and we, as SNOs, can either go along with it or exit the project.

I think that whilst it is necessary to always aim for improvements and bug fixes, it is not reasonable for us to try to dictate the company’s direction of travel or resource allocation based exclusively on technical aspects no matter how frustrating the bugs may be and how much it annoys people who want the software to be perfect ASAP.