Two weeks working for free in the waste storage business :-(

Ambifacient · July 7, 2024, 6:14pm

For those using the calculator, you should update to v1.107.3 and let it run the used space file walker. It was only recently fixed that the TTL deletes were not updating the used space correctly. In addition, trash deletes are only updated at the end of successful clean, not in realtime.

The calculator is a good tool, but people may be misled that it is an authoritative source of truth.

Mitsos · July 7, 2024, 6:19pm

That’s why there are those among us that have been saying for a while that something is wrong. Me for example (as I have said in another thread) I’ve been noticing that while my reported used space together with the OS reported used space has been going up (very important to combine the two, do not quote me out of context on this), payouts have been going down every single month. That means some data is getting left behind month after month.

I’m not doing anything “special” with my nodes, so this affects every SNO out there as well.

Edit: note I don’t use the earnings script.

Ruskiem · July 7, 2024, 7:31pm

its not the matter of listen or not bro, they are busy as F, doubling and tripling efforts on backstage which You cannot see directly. (I can! i’m phychic, im scanning the energy field of storj every day and i can sense that! ;3 )

@everyone, Mitsos,
Did You freakin’ see the amount of changes under every version?
Can You keep up with that? they have “million” small things to keep up with and test from version to a version, Like vital. From a big picture, it doesn’t matter if half of the SNO will be pissed off right now, and even leave, because they need to get things right for the new environment (adaptation), and they probably compute to the max of every human brain cell available. And You have infinity things to compute on, in such clock like complex system like storj’s network.

You just can’t have it all in short period of time, some changes are being addressed, but the code is there waiting to roll out to the nodes, this alone can take weeks of delay! But the solution is out there! But You wont see it in weeks! Because thats the way things are! On this scale of developing operation, You look for example how long it took ethereum to change. Its not just developing a prototype, where nobody is hurt if something goes wrong, they are in operation, it’s like changing an engine of a car that is speeding on highway.

I deplore/regret how they write back on the forum sometimes, but they don’t have much to say, they are busy doing.

let me F’n show You this,
we are in the part that balls looks like moving chaotically and pointless, but they are not:

it just looks like that in short period of time, but from the length of a 3 minutes You can understand they didn’t swap to chaos, but are still in order, still doing what they meant to.

You would have to be in the process of developing to understand that.
But You are not. Go apply storj. Contribute in github, i heard its open source.
Oh You’re not a programmer oh, then ~~sh** up.~~ oh im sorry, i mean be understanding.
The Big upload test just began like month ago, a drastic changes to the situation, forces tests and upgrades to all systems for that. Test and observation of changes made in the code takes time, no matter how good You are. You can’t just fast forward time. Means if You are SNO, get ready for 2-3months of things looking like chaos, until they settle down. Its probably not humanly possible to make it quicker in current condition. If someone think otherwise, why don’t You jump in and help program all this? Developing a complex code like this is the ultimate human endeavor.

zip · July 7, 2024, 7:37pm

These discrepancies might be overblown.
Satellite reporting was unreliable for quite some time now and a week ago ~~20% of EU data was purged from the network, but there was no EU BF AFAICT. And this on some nodes might be TBs of data that are being reported as uncollected trash.
So everyone just chill out, in a month or two that data will be gone from your nodes as well.

Mitsos · July 7, 2024, 7:42pm

@Ruskiem I don’t agree with the statement that “too many things are changing, so we can’t keep track of everything”.

In every normal rollout of changes, they are first done on staging systems, just to make sure that nothing breaks, then on testing systems (which verify that indeed nothing broke), then on production systems.

This has not been happening for weeks. People are acting as if these issues are just a day old and we are all patiently waiting for a fix next week.
Lazy trash-cleanup: broken on release. Not tested.
TTL data deletion: broken on release (which reminds me, is this the first time TTL data has been uploaded to the network?). Not tested.
Bloom filters: broken since day 1 of their release (worked up to ~8TB nodes). Not tested.

You see they are overworked. I see a fundamental flaw in the deployment of updates. Both can be fixed.

pangolin · July 7, 2024, 7:46pm

Sending BFs on a regular basis doesn’t need any further development. So why isn’t it done?

Ruskiem · July 7, 2024, 7:51pm

Are You programmer? can You help fix that?
I kindly would like to remind You that for example, a famous Ledger (nano s) was a complete piece of for 2-3 years until they finally manage it to be civil, in terms of the process, the way the device present and process the sending on the screen, i could not believe how this could be without THE upgrades like that, for sooo many years, but it were. Ledger is how many bigger name than storj? But Yet they couldn’t fast forward, and the things has its natural order of things, when they can fall in place. and the reasons why they could’nt be faster, You have to be deep involved in the process to understand, and we here, are all NOT. So humility is the key word.

Mitsos · July 7, 2024, 7:56pm

No I’m not. But last time I checked the world hasn’t run out of programmers. I’m sure a job posting would attract the right people, 2 interviews later, a month of getting to know things, and a month of implementing the changes they suggest would get things fixed a lot faster than me trying to learn to program.

WRT the rest of the reply: That’s not how CI works. You make a change to the code, the code is automatically deployed and tested on staging/testing systems. If the tests fail, then the code rollout is stopped. Are you telling me that the code was tested and put into production even though it was broken? Of course not. It was simply not tested. Can it be tested so this doesn’t happen again? Yes it can.

Ruskiem · July 7, 2024, 8:01pm

Yea and how long it would take for a new person to get familiar with all this complexity and understand, how this all is working and why? 2-3 months? yea, great, make a contract with someone or few someones, which will eat already existing team’s resources to deploy new people, so it will actually slow things down more

Mitsos · July 7, 2024, 8:04pm

Then we should focus on working harder instead of working smarter.

BrightSilence · July 7, 2024, 8:15pm

@Mitsos , I’m going to kindly ask you to stop or move this discussion to another topic. You’ve already made @littleskunk stop following this topic. And I can’t really blame him either. Can’t you see this is hurting more than it helps?

Let’s keep topics about specific issues about being constructive and providing information to help solve the problem and complain in the existing topics that already talk about the range of open issues. I thank you for your consideration.

Ruskiem · July 7, 2024, 8:15pm

Oh yeah, if You, from my experience, If You were to test for all possibilities or just test really good, You would have to make things a lot more slower. What i presumpt, is in current state, things cannot be tested other than by real nodes, no matter how good test side You have on backstage, because its too much diverse. So many different configurations. What if They tested, and all was OK on the test side? The real tests are in real life here. Again, we are not involved deep enough in the process, to understand the reasons why things in this development are the way, they are, but You can’t accuse Storj of not wanting success, can You?
They are the ones who care the most, if they succeed, so probably they do all what is in their power.

Roxor · July 7, 2024, 8:22pm

Part of me fears… that somewhere in a dark corner of Storj’s test environment… is a full 8TB SMR plugged into the USB2 port of a RPi. And it just… never… finishes… filewalker… before a new build is loaded.

It’s whole life… toiling away… 100% IOWAIT, until the drive eventually dies.

Actually who am I kidding: I’m sure their dev systems are all-flash!

Toyoo · July 7, 2024, 8:39pm

It’s not that we’re doubting Storj wants success. These doubts are of different nature.

pangolin · July 7, 2024, 9:07pm

I don’t think so, real life is… You make a change to the code, you make some very basic tests at best, code is rolled out, customer does the real test.

Mitsos · July 7, 2024, 9:31pm

Two of my replies before your quoted one, were directly related to the topic (uncollected garbage), and two others were suggestions to use CI to improve code deployment.

But dully noted, if you think I’m going off topic, ask someone to move my replies out.

Mad_Max · July 8, 2024, 4:15am

I DID receive few BF in last days. But looks like all these BF is just a re-sends of old BF which my nodes were already processed back in June:

|2024-07-07T12:51:07+03:00|INFO|retain|Prepared to run a Retain request.|{cachePath: C:\\Program Files\\Storj\\Storage Node/retain, Created Before: 2024-06-09T20:59:59+03:00, Filter Size: 6918625, Satellite ID: 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE}|
|2024-07-07T12:51:07+03:00|INFO|retain|Prepared to run a Retain request.|{cachePath: C:\\Program Files\\Storj\\Storage Node/retain, Created Before: 2024-06-13T20:59:59+03:00, Filter Size: 8959048, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S}|
|2024-07-07T12:51:07+03:00|INFO|retain|Prepared to run a Retain request.|{cachePath: C:\\Program Files\\Storj\\Storage Node/retain, Created Before: 2024-06-18T20:59:59+03:00, Filter Size: 942802, Satellite ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs}|

They are still scanning /blobs/ right now, but they are NOT moving anything to /trash/. Simply because they have already been processed before once and all files match the filter have already been moved to the trash. Moreover, they have even been permanently deleted already (because it has been more than 7 days since they were collected)

Alexey · July 8, 2024, 4:22am

Yes, it works

root@9fb28e5cf48a:/app# wget -O - http://localhost:45311/mon/ps
...
[4760270949382279994,5864136913576511724] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 229h20m23.13
69893s, orphaned)
 [7692339406170535857,5864136913576511724] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (el
apsed: 229h20m23.136719s)
  [7256014115353804075,5864136913576511724] storj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComput
eSpaceUsedBySatellite() (elapsed: 65h26m28.1387957s)
   [8722048343747932007,5864136913576511724] storj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run() (elapsed:
 65h26m28.1387384s)
...

Alexey · July 8, 2024, 4:34am

Interesting, my node received a more recent BF

$ grep "Prepared to run a Retain request" /mnt/w/storagenode5/storagenode.log
2024-07-05T20:51:20Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-29T17:59:59Z", "Filter Size": 950195, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-06T07:54:28Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-07-02T17:59:59Z", "Filter Size": 95294, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}

but only for two satellites so far.

Mad_Max · July 8, 2024, 4:42am

Did you restarts node in between these gaps? Because yesterday I just killed TTL collector on one of my nodes by simple restart.

I have already created bug report for this on GitHub: Storagenode TTL piece collector stops working completely if its previous run was interrupted by node restart · Issue #7042 · storj/storj · GitHub
Please check your logs. There were errors (even simple “context canceled”) due to the interruption of the TTL collector by node restart similar to those described in GH issue?
And if so, add your logs there too.

If not, then most likely there were no stops in collector work and it is still working, just VERY slowly.
If you look at your log more closely, you can see, the size of the “gaps” is clearly in proportion to the number of deleted files. And the next instance of the collector did not start, simply because the previous one had not completed its work yet.