Two weeks working for free in the waste storage business :-(

There might be additional reasons too:

Not really, no. Yes, there have been other issues and some are not being picked up with priority. The big difference is that those issues were either being actively worked on or impacted a smaller subset of nodes, with much lower impact. This issue impacts all nodes and has caused immense overuse with garbage data for everyone.

I appreciate you keeping them sharp on other things though. But I don’t share your general position of mismanagement on these issues. Hence why my post says this is uncharacteristically bad.

2 Likes

Where can I get this script?

1 Like

Thanks for fast reply;
Here we go guys:
–node1:

July 2024 (Version: 14.0.0)                                             [snapshot: 2024-07-07 12:56:51Z]
REPORTED BY     TYPE      METRIC                PRICE                     DISK  BANDWIDTH        PAYOUT
Node            Ingress   Upload                -not paid-                      705.68 GB
Node            Ingress   Upload Repair         -not paid-                       19.91 GB
Node            Egress    Download              $  2.00 / TB (avg)               50.43 GB       $  0.10
Node            Egress    Download Repair       $  2.00 / TB (avg)               29.72 GB       $  0.06
Node            Egress    Download Audit        $  2.00 / TB (avg)               17.44 MB       $  0.00
Node            Storage   Disk Current Total    -not paid-             6.13 TB
Node            Storage              ├ Blobs    -not paid-             4.87 TB
Node            Storage              └ Trash  ┐ -not paid-             1.26 TB
Node+Sat. Calc. Storage   Uncollected Garbage ┤ -not paid-           870.91 GB
Node+Sat. Calc. Storage   Total Unpaid Data <─┘ -not paid-             2.13 TB
Satellite       Storage   Disk Last Report      -not paid-             4.00 TB
Satellite       Storage   Disk Average So Far   -not paid-             3.99 TB
Satellite       Storage   Disk Usage Month      $  1.49 / TBm (avg)  866.34 GBm                 $  1.29
________________________________________________________________________________________________________+

–node2:

July 2024 (Version: 14.0.0)                                             [snapshot: 2024-07-07 13:00:16Z]
REPORTED BY     TYPE      METRIC                PRICE                     DISK  BANDWIDTH        PAYOUT
Node            Ingress   Upload                -not paid-                      714.00 GB
Node            Ingress   Upload Repair         -not paid-                       19.87 GB
Node            Egress    Download              $  2.00 / TB (avg)               50.33 GB       $  0.10
Node            Egress    Download Repair       $  2.00 / TB (avg)               18.04 GB       $  0.04
Node            Egress    Download Audit        $  2.00 / TB (avg)                6.41 MB       $  0.00
Node            Storage   Disk Current Total    -not paid-             4.39 TB
Node            Storage              ├ Blobs    -not paid-             4.13 TB
Node            Storage              └ Trash  ┐ -not paid-           255.45 GB
Node+Sat. Calc. Storage   Uncollected Garbage ┤ -not paid-           591.25 GB
Node+Sat. Calc. Storage   Total Unpaid Data <─┘ -not paid-           846.70 GB
Satellite       Storage   Disk Last Report      -not paid-             3.54 TB
Satellite       Storage   Disk Average So Far   -not paid-             3.53 TB
Satellite       Storage   Disk Usage Month      $  1.49 / TBm (avg)  733.07 GBm                 $  1.09
________________________________________________________________________________________________________+

I didn’t know that we may pause BF as now. Perhaps there is a bug. Pinged the team as well.

My nodes have “only” 1/2 of the garbage as well. 5.27TB from 9.28TB to be precise… :man_facepalming:

5 Likes

@Alexey - Im wanting to avoid any negativity - but this is what we have been calling out for weeks now, having rulled out all other causes.

Why are we only now approaching developers?

Thanks
CC

Even with no bloom filter SLC should still delete the pieces by TTL.

Can you maybe check /mon/ps output? I noticed that some of my nodes are hours behind. For example this one:

[792702298643105108,590377817016879825] storj.io/storj/storagenode/collector.(*Service).Collect() (elapsed: 9h5m52.90537853s)
 [995026780269330392,590377817016879825] storj.io/storj/storagenode/pieces.(*Store).GetExpired() (elapsed: 9h5m52.905361937s)
  [1197351261895555675,590377817016879825] storj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired() (elapsed: 9h5m52.905363011s)
   [8327577142405913580,590377817016879825] storj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0() (elapsed: 110.590232ms)
    [8529901624032138863,590377817016879825] storj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat() (elapsed: 110.576307ms)
     [8732226105658364147,590377817016879825] storj.io/storj/storagenode/blobstore/filestore.(*Dir).Stat() (elapsed: 110.573696ms)

Because there are a few very loud people in the community that call every bug highest priority. If everything gets highest priority then nothing gets fixed. It is as simple as that. It would help if we could downgrade a few bugs and just tolerate them for now to make sure the developers have more time to work on the important problems. It is just too many context switches at some point.

The other problem is that the community is getting a hostile place. We are fixing bugs but the same loud people are still demanding that the developers are doing something wrong. What do you think is going to happen? Every human beeing will just stop reading the demotivating speech. So developers will simply stay away from the forum at some point. So we are losing the healthy communication line we had before.

8 Likes

I think the devs are doing a great job! :heart_eyes:

When a disk fills: I expand. If some of that disk is trash: don’t care: that’s a problem that will be fixed: still expanding.

2 Likes

We could still spend time on filing a good bug report with as much details as possible. There is some middle ground.

1 Like

And that is exactly why I pick my battles and don’t turn one larger issue into “everything is horrible and you suck”. I don’t think that’s the case at all and I recognize and see the progress being made. But I think this is one worth fighting.

I don’t currently have my debug port open in the docker container and I don’t want to restart my node right now. But here is what I can tell you. My node is hard at work deleting expired data (I still have debug logging on, which is probably a bad idea for IO right now, but that’s why I know). I also know this node already had 5.5TB of uncollected garbage before the TTL clean up even kicked in. So the vast majority of it is not from being behind on TTL cleanup, but up to about 2.5TB might be.

It also shows quite significant gaps where the collector doesn’t run. Is this expected behavior?

2024-06-30T02:21:22Z    INFO    collector       collect {"Process": "storagenode", "count": 1910}
2024-06-30T03:25:55Z    INFO    collector       collect {"Process": "storagenode", "count": 1915}
2024-06-30T04:17:29Z    INFO    collector       collect {"Process": "storagenode", "count": 942}
2024-06-30T07:21:49Z    INFO    collector       collect {"Process": "storagenode", "count": 1676}
2024-06-30T08:14:58Z    INFO    collector       collect {"Process": "storagenode", "count": 474}
2024-06-30T09:36:13Z    INFO    collector       collect {"Process": "storagenode", "count": 7556}
2024-06-30T10:49:50Z    INFO    collector       collect {"Process": "storagenode", "count": 12996}
2024-06-30T11:43:41Z    INFO    collector       collect {"Process": "storagenode", "count": 5698}
2024-06-30T12:38:14Z    INFO    collector       collect {"Process": "storagenode", "count": 6844}
2024-06-30T13:35:05Z    INFO    collector       collect {"Process": "storagenode", "count": 7032}
2024-06-30T14:37:16Z    INFO    collector       collect {"Process": "storagenode", "count": 7437}
2024-06-30T19:33:53Z    INFO    collector       collect {"Process": "storagenode", "count": 80966}
2024-07-01T14:57:10Z    INFO    collector       collect {"Process": "storagenode", "count": 333967}
2024-07-05T13:08:21Z    INFO    collector       collect {"Process": "storagenode", "count": 1453569}
2024-07-05T14:37:29Z    INFO    collector       collect {"Process": "storagenode", "count": 144170}

By default the storage node will open a random port. I don’t run a docker node. Is it possible to exec into it? That way you can still get the output even with no port forwarding

@littleskunk - I totally get what your saying, I run a dev team. I dont think I have been loud, hostile, or demotivating in case thats what you feel - all my posts have been respectful.

Equally - this may not be a bug, but lack of bloom filters right? Either way this is pretty crucial to the whole Storj mantra, or reason for being, ie. usable diskspace. It weakens any upgrade or expansion discussion if space is being knowingly wasted too.

It kinda feels important to at least triage, dont you think? I am at your disposal too to help, just tell me what other steps or infdormation I can furnish you with.

Thanks
CC

1 Like

Bugs/features that have a direct effect on customers are the highest priority. You can’t feed the company with no customers.
Bugs/features that have a direct effect on SNOs are the 2nd highest priority. You can’t get more customers if there are no SNOs to serve them.
Bugs/features that have an indirect effect on customers are the 3rd highest priority. Sure it will be be nice to fix them as soon as possible, but we can kick the logo change down the line a bit.
Bugs/features that have an indirect effect on SNOs are the 4th highest priority. Good to fix them, not directly affecting the network performance/operation.
Every bug/feature that doesn’t fall in either of those categories (ie satellite reporting how many active nodes there are), gets to the end of the line.

Currently this isn’t happening, which is why the loud people are getting louder and louder. Every single thing we have reported in the past year gets a “low priority because we have other things to deal with”. At some point one does have to wonder: if everything isn’t being worked on, what exactly is being worked on?

Allow me to correct you there. You are fixing bugs that should have been fixed a long time ago. It’s not the same as actively fixing bugs. This, as I said, is why communication is breaking down. SNOs will simply give up on reporting bugs if everything gets pushed down the line more and more, and just let the network collapse with storj wondering “why is the network full?”.

Sorting the github open issues page by oldest shows that the oldest open bug is from 2019. In 5 years nobody could find the time to either close it or work on it?

But I have the bad habit of giving people the benefit of the doubt. Let’s say the devs are so busy that they barely even find the time to sleep. Doesn’t that raise a few red flags? Sounds to me that more devs should be hired, but that’s what I would personally do, not saying you should do it.

3 Likes

Time to stop following this thread myself. Have fun complaining.

2 Likes

See ya around, take care.

2 Likes

What is the actual solution ?

The actual solution is to get the storagenode part working as it should, then get the satellites to report the correct data (wrt tracked pieces) which should fix the timely bloom filter generation, which loops back to the storagenode part. Uncollected garbage starts getting collected as it should, space is freed and tracked, everyone is happy again.

My $0.02. I’m sure others will disagree on the order of these events, and I’m cool with that.

1 Like

image
Need a fix. Soon. People around starting to think that you are not able to run a storage nodes activity… words spreading :slight_smile:
It is not acceptable and compromises the credibility of the project