Updates on Test Data

Roxor · June 14, 2024, 12:52pm

SNOs are service providers: we don’t pay Storj like customers: we get paid. And payouts have been correct month-after-month. If we don’t feel we’re getting paid adequately: we stop providing our service (and find another project)

Andrew · June 14, 2024, 1:01pm

I have the same problem: 71 TB in trash and 166 TB used. Ext4, 2 nodes on a single HDD (I know that in the “new Storj reality” we shouldn’t use more than 1 node per HDD, but we’ve been here for years and we have what we have). It’s been a while since the last time my nodes completed GC on the US satellite, which is why I have had so much data that should be removed. What I really don’t understand is why the devs took so long to roll out version 1.105.4 to Docker until yesterday (this version fixes a critical bug with removing retain files when the node is restarted). It is obvious that many SNOs restarted their nodes while tests are running to adapt to increased data flow and lost retain files because of the bug. If there is someone like me with a lot of unremoved trash (I mean in used, not in the trash directory) and still not updated manually to 1.105.4 (I did it a couple of weeks ago), then it will be another month before their nodes clean themselves up. If I understood @littleskunk correctly, this may decrease network throughput.

zip · June 14, 2024, 1:14pm

The satellite payout data is actually quite useful, especially now when we are being asked to bring new capacities online.
That is probably the only reliable source of income predictions, which in turn enabled the SNOs to make purchasing decisions.
Me for example, I’m now totally blind, as all the calculations predicting the income based solely on the used space are quite useless, especially in times of issues with GC, where unpaid data on nodes might not be trashed for months, skewing the income predictions quite dramatically.
Now there is another unknown with the Saltlake test data, how much of that is actually being accounted for?
So I guess this should be one of the priorities, to make these data reliable.

MarviBiene · June 14, 2024, 1:25pm

I like the project and some critics isn’t bad. I just wanted to know the things mentioned. Maybe it’s a fault my end that I don’t quite understand.
I like the project and will continue to support it. Maybe it sounded more aggressive than it should be. Sorry if that was the case.

Andrew · June 14, 2024, 1:31pm

Maybe you should slow down a bit in defending every developer’s failure. This is getting ridiculous. Stop being a fanatic.

Mitsos · June 14, 2024, 1:31pm

It’s actually the Old Way of Doing Things™, see https://www.storj.io/legal/supplier-terms-conditions:

4.1.4.1. Have a minimum of one (1) hard drive and one (1) processor core dedicated to each Storage Node;

Last GCs from all the satellites were a couple of weeks back. Your GCs can’t finish because you are overloading the nodes against any advice not to.

Since we recently had successful GCs on all satellites, and the bloomfilters have been increased because larger nodes couldn’t keep up with the deletes (missed files because blooms were too small), I don’t think we’ll see the “months” of untrashed data again. A week, sure, but that’s nothing in the grand scheme of things.

All, except a single 10x64MB test.

Andrew · June 14, 2024, 1:38pm

My GCs can’t finish because:

We didn’t have the “save-state-resume feature for GC filewalker” until May (April?) 2024.
The “save-state-resume feature for GC filewalker” actually wasn’t working because of a bug with removing retained files on restart until version 1.105.4.

I am pretty sure that my nodes will be okay now, after these bugs was fixed.

MarviBiene · June 14, 2024, 1:40pm

I wouldn’t say it’s a developer failure. I woul more likely say it hasn’t that priority. They did a great job in increasing performance and so on. But the metrics weren’t fixed for like 2 (or 3 months). And the problem with wrong data is not unknown

zip · June 14, 2024, 2:20pm

Then I don’t know what the deal is.
Node is showing 4.56TB of SLC data, the Satellite only 3.07TB of SLC data stored on that particular node.
Another one 7.66TB of SLC data after some of it was removed today, Satellite 5.62TB of SLC data on that node.
But maybe Satellite tallies lag behind.

Roxor · June 14, 2024, 2:22pm

Replying to myself: 10am EST ingress burst starting again - going on 4 days now. Hooray for test data!

pangolin · June 14, 2024, 2:31pm

I can see GC deleting massive amounts of 1-2 weeks old data from slc. This data should be 30 days TTL so how is this possible?

peter_linder · June 14, 2024, 2:35pm

My guess is left overs from lost races.

littleskunk · June 14, 2024, 3:23pm

I am not 100% sure yet but it looks like the uploads didn’t had enough randomness in the filenames and some uploads got overwritten. We fixed that yesterday and now the dashboard that was previously not reflecting the expected increase is finally showing some increase. Tomorrow the dashboard will give me the final answer. We might have more than one problem. So I wouldn’t rule out additional bugs in bloom filter creation etc. Maybe watch for audit errors and track down if the corresponding piece was deleted by garbage collection. That would be a good indicator that we have a bug in that area as well.

For now I will hope for the best. The best in this situation would be that there was only one bug and we fixed it already. Tomorrow and all the following days the dashboard shows the increase we need and everything is on track. In that case all we have to do is correcting the payout a bit to compensate for this setback. I am not the person to decide that but I will bring up that topic in the next meeting. Getting back with a decision will take a moment. It isn’t the highest priority.

Alexey · June 14, 2024, 3:34pm

This could be a case only in one state - you have deleted databases. Otherwise the TTL expired data will not be collected by GC.

Alexey · June 14, 2024, 3:38pm

filewalkers didn’t finish with a success
Databases are corrupted or locked when the update is required
Average Disk Space Used This Month is not updated by satellites (but you also may compare the actual used space on the disk with a piechart of the usage).

I’m sorry, but for that you need to be a customer. You are a supplier, I’m sorry!

Alexey · June 14, 2024, 3:40pm

Do you want to help? We are hiring: Careers or you may submit a PR, our devs are happy to accept the Community contribution!

Alexey · June 14, 2024, 3:42pm

Unfortunately @Roxor is right. They knows, how it works. You do not need to be a fanatic, just a respectful Community Member to post suggestions.

Alexey · June 14, 2024, 3:45pm

Unfortunately, @Mitsos is right too, your GC is failed on your setup. So, we need to fix your setup to let it finish.

Alexey · June 14, 2024, 3:47pm

You are right. The SLC satellite has issues to finish tally in time. However, the gap will be fixed before a payout period will start.

Andrew · June 14, 2024, 4:01pm

Submit a PR that the devs have had a version with the fix for a few weeks but, for some reason, forgot to roll out this version to the nodes? Really?