A new logo for storagenode

jammerdan · May 12, 2024, 7:14am

Ok so, sorry about the tone, but I am really angry about it. All I can say is that I find the current situation really not satisfying. Honestly speaking quite a disaster unfolded and management is silent. I get this uneasy sense of there is a detachment from them from the reality of the nodes.

Yes, technical team members are on the threads actively trying to straighten things out but they also make clear they do not speak for the company but for themselves. And they don’t speak for the management persons. Where is the “head of something” to make a statement about the various issues we are plagued with? Or that the progress on fixing things is closely monitored on top level and steps are introduced to prevent such a thing from happening again? That would be management leadership.
I have been with Storj for some time now and I cannot remember a situation like this before where there are so many issues that you ask yourself if Storj is up to par or even capable to handle its own code.

There’s that issue with a fundamental core functionality of the network where it failed to delete hundreds of terabytes of customer data from the nodes. This could potentially impact customers who rest assured their data got erased, could pose even legal issues. But of course also node earnings are impacted when nodes are full but not getting paid for the space used and additionally cannot receive ingress because they are considered full. At the end even the network cannot ingest data as expected when the nodes reporting as full while in reality there is plenty of space because the nodes are filled with garbage.
This was created not by one single but by multiple different implementations that proved as inadequate which make it appear as if they were led by a completely different impression how the network behaves than it does in reality. It appears to be as there was not enough understanding, anticipating, testing and monitoring.

Now when observing how things are getting fixed I am seeing new issues surfacing left and right. Nodes were unintentionally downgraded which has created new risks as downgrades usually are not tested. Thankfully the next disaster with a tsunami of broken nodes because auf that did not happen but it seems we were just lucky.
And with the fixes thrown at us, new issues arrive some of them even introducing the same kind of problems again, like

Filewalkers not updating databases, so ingress stops again for some nodes
Total mess with trash date folders leading to data not getting deleted from the nodes again and no ingress for nodes again
Bloomfilters getting deleted again instead of getting stored and resumed, preventing deletions and no ingress again
Resuming used-space filewalker not implemented yet, which means for some nodes the used-data garbage gets deleted but used space does not get corrected so still no ingress.

And SNOs must find workarounds for these new issues left and right.

Of course management can choose what to comment on and pretend that’s how it should be but that is just sad. This is not the quality that I have felt Storj was delivering in the past and if I was a customer it is certainly not the quality that I would entrust to store my encryption passphrases for me as Storj plans to do for their customers. So for me I can say this situation does not feel normal and costs reputation. And I ask myself

Is the management aware of all of this and of the core functionalities of the network failing on multiple levels. Do they even consider this as normal?
How was it possible that the code was not prepared for growing and larger nodes, why was this not anticipated and constantly monitored and tested with the nodes growing?
Is there generally enough testing? Does Storj run enough nodes themselves with different sizes, ages, speeds and hard- and software to test properly?
Is monitoring sufficient? Why is it always the node operators who detect the new issues?
Are coding processes up to par or need refinement?
Are developer resources sufficiently allocated?
Is there a roadmap for node software with clear operational goals and plans what performance should be achieved and when?

Well and if they would agree that the current situation is not normal and not what it should be and that it needs improvement then I ask what are their plans to prevent such a case and regain reputation?
External audits for code and processes?
Hackathons?
Add developer resources?
Provide better tools for devs and SNOs so issues can be detected, investigated and reported better and quicker? (First reports that something must be wrong with the used space were posted 1 year ago in the forum)

But it is not in my responsibility nor do I have of course the insights into the current mess to make the adequate suggestions and introduce required changes. That’s the duty of the management. All I can say is, it does not feel the way it should be.

And before anybody can get me wrong: I am seeing the work of the devs and their attempts to fix things and at least in theory some things that were pushed out look promising. But to put it with the words of another SNO:

Alexey · May 12, 2024, 9:17am

I would hope that the version v1.104.x would fix all issues with a performance.
However, there are a lot of work of course. And management is aware, but they trust our engineers, because they can solve any issue, just need a time. The resource which is tight unfortunately. But we are working on them!

littleskunk · May 12, 2024, 9:44am

The reality is engineers have limited time. Sure we can fix all the storage node issues almost as soon as they show up. But at what costs? It does mean there will be less time for other issues. So the alternative would be a world with no storage node issues but also no paid space on disk. You can’t get both.

So what is more important for your storage node. All bugs fixed even if that means no customer data or as many customer data as possible even if that means there are some bugs to deal with?

bre · May 12, 2024, 10:45am

Thank you. I really appreciate you taking the time to explain all that you have been thinking on this. I could tell there was more behind your comment which is why I asked.

I can only echo what Alexey and Littleskunk have said regarding technical issues, as they are closer to the inner workings of what you are asking. Management is aware and plans are in place according to all available time and capacity of efforts.

Regarding hackathons and such, it was discussed and explored in the past however things of this nature were put on the back burner due to the limitations of time and necessities of prioritization. It’s just where things are in the current growth stages. However these are things we can revisit in the future. I know the Community would deliver astounding hackathon entries, and speaking for myself I hope we get to do something like this with a lot of lead time so everyone can show off their best ideas.

Alexey · May 12, 2024, 10:50am

And I want to visit a hackathon in Germany (I have relatives there also), I love this country , even if the internet there is a crap…

Roxor · May 12, 2024, 11:06am

As another perspective: I see a Storj node as an app that just kinda does it’s own thing, that I don’t really have to do maintenance to and can ignore. It upgrades itself. I get email if it goes offline. Payouts arrive promptly every month, and they’re increasing as the disk fills. If there’s a “disaster” in there… I must have missed it.

I don’t care if stats aren’t accurate to the precise bit: if a node is Status:Online and QUIC:OK and I’m winning 95+% of upload/download races I’m happy. If extra trash has been hanging around that’s no great hardship: especially if Storj reps in this forum are telling us what’s being worked on (right down to exposing specific code commits).

So I hope the Storj team knows their efforts are appreciated! For every HDD you fill I’ll bring a fresh one online… for as long as being a SNO remains easy money

jammerdan · May 12, 2024, 11:48am

Yes I think you did:
Not deleting customer data is the first disaster.

The second is if you want to put yourself into shoes of some other SNO. Let’s say someone with a 10TB HDD. Like you he watches the node grow. Used space is going up. Soon the node hits 10TB of used space. Based on how quick this went, SNO purchase a 20TB disk to accommodate for the expecting growth. And then a message arrives telling him that Storj is terrible sorry that the deletion code did not work as expected and 6TBs of his now 12 TB on the new HDD are garbage and get deleted.
And as I am sure somebody will mention the don’t invest mantra. Even if this SNO did not buy the larger disk and had stuck to his existing disk, a full 10TB disk where 6 TB are unpaid non-deleting garbage instead of receiving the ingress he would be capable of, is a disaster.
I don’t have another word for it.

Alexey · May 12, 2024, 11:54am

This is not easy to implement. Remember - everything is encrypted, include metadata. So we can only make these segments expired (it’s implemented), and then the satellite will remove all expired segments (actually it will send this to storagenodes of course), then, after a while all data from the expired accounts will be deleted and the satellite would be able to remove the now empty account…

littleskunk · May 12, 2024, 1:52pm

Everything seems to be a disaster for you. How about you shutdown your node to stay away from any future disaster?

Roxor · May 12, 2024, 2:01pm

They may not have time to decommission their node.

This morning they were short on milk for their breakfast cereal: so had to drive to the store to address that calamity. And they hit every single red light on the way: which was a major setback. Then there was the catastrophe of having less than a half a tank of gas. And the fiasco of the radio station not playing their favorite song. Don’t even mention the debacle of trying to find a parking space! And in the end… tragically… the milk wasn’t on sale.

Who has time to deal with a node… when the world is falling apart…?

jammerdan · May 12, 2024, 2:02pm

Yes of course as SNO we want our drives getting filled up as fast as possible but Storj created a situation where they filled them up until they were full and failed to delete data accordingly even for multiple different reasons.
Is this what SNOs want? Probably not.
If you think these are funny little bugs SNOs should not care about, then I cannot help you. If it is so funny and minor then pay me for the space you were occupying.

littleskunk · May 12, 2024, 2:07pm

I am running a storage node. I don’t need any help for keeping it running. I am happy with my current payout. I might call some decisions bad but I have to admit the outcome is speaking for itself. Times and times again my manager told me that the fix I am waiting for isn’t the highest priority and times and times again that was correct. My node didn’t run garbage collection for 2 months. Sure the cleanup at the end was a bit painful but my payout keeps increasing. I don’t care about some painful side effects as long as my payout keeps increasing.

Alexey · May 12, 2024, 2:18pm

This is fun, but I believe some kind of disrespect for the interlocutor, sorry.
I do run my nodes too… But perhaps I’m too tolerant for the issues… because I do not watch for them every time…

jammerdan · May 12, 2024, 2:20pm

For me deleting data from the network is a fundamental core functionality that should just work. If that fails it is a huge issue. If it fails on several different levels for a long time and remains unnoticed then it’s a disaster for me. There were hundreds of terabytes of customer data that did not get deleted for multiple reasons and Storj did not realize or notice. I would be very much interested in the reaction of customers if they get told that their data does not really get deleted when they press delete, but “some” time in the future. Maybe after 7 days maybe even longer. I doubt this is what they would expect.

How about Storj paying me for the space they really occupy on my nodes and abstain from creating future disasters?

Alexey · May 12, 2024, 2:22pm

We are working on fixes.

jammerdan · May 12, 2024, 2:32pm

Great.

Ok. Not so great.

Great for you.

ACarneiro · May 12, 2024, 2:34pm

Errrr…. I thought the deletes were dealt with bloom filters?

Haven’t really noticed any major dramas over the last few months but then again I have loads of free space available left in my nodes so even if there is currently an issue it will not impact me.

If there is indeed leftover garbage will it be deleted at a later date or do I have to take some action to manually delete it?

Errrr:… deleting files is lot easy to implement? Isn’t that core functionality? Or did I muse deram and your post?

jammerdan · May 12, 2024, 2:34pm

No, I did not see payment for the data that remained on the nodes because the bloom filters where too small. I did not see payment for garbage that remained on nodes because bloom filters got deleted at a node restart. And I did not see payment when garbage remained because it was not moved to trash because the collector restarted before its end.
As said: Pay me for what Storj really occupies and you can have as many bugs or failures by design like that without me complaining a single bit.

Alexey · May 12, 2024, 2:40pm

They are as far as I know:

no, just wait for the latest update.

no, the deletion itself is easy. But we need to have a safe buffer, if we would introduce a bug, in the current situation we would have a week to resolve it before it would be a catastrophe for everyone.

Alexey · May 12, 2024, 2:43pm

ok. I corrected my response, not all is paid, I’m agree. However, it could be a DQ in the case if you would delete it.