On drive space issues for SNO

We are aware of multiple issues related to drive space inconsistencies. The engineers are currently heads down working on these issues.

The TTL issue of data being cleared but ingress not resuming has a fix being tested now.

It looks like testing uploads will be paused over the holiday to allow garbage collection to catch up. They are discussing this tomorrow, as I understand it.

There are various other conversations going on around work and ideas to address the additional issues that SNO’s are seeing.

The team cares about these issues and are working to resolve them as soon as possible. We ask for your patience as testing and changes can take time to make sure they don’t introduce other unforseen issues.

If you have any general concerns about drive space being freed up, garbage collection and other delete processes, please comment here. Especially if you feel there are some areas the team is not aware of.

Thank You for being SNO’s and participating and working with us as we stress test parts of Storj.

12 Likes

Garbage collection doesn’t benefit from an upload pause. Repair is the one that is competing with the current uploads. Pausing the uploads would allow the repair worker to drain the repair queue faster.

Edit: Rest of your statement is correct. It will be discused tomorrow. I would give this a 50/50 chance.

5 Likes

Right on. Thanks for the correction.

So… some nodes don’t have space, but think they do. And other nodes do have space but think they don’t?

3 Likes

Thank you for the update, I share the frustrations.

Hopefully the complaints include:

  1. large amounts of data considered deleted by the satellite but still “used” by our nodes. aka “uncollected rubbish”. Speculated to be a problem with inadequate or not enough bloom filters.

  2. large amounts of filewalkers failing on setups (I’ve seen it with just “context canceled” as the only error message).

  3. more of a design choice than a bug, but trash remaining on the node too damn long. Between the delay before it is garbage collected plus a week of retention in trash, it seems excessive.

Just like to add I have one node that shows up smaller in the multinode page than it does on the node page - Windows 10 pro x64 - node v1.105.4

Its a 14TB drive solely for the node.
it shows 1.83TB free of 12.7TB in win explorer
on the node page I see this

On the multinode page for the Node I see this

Neither of which corresponds to the Actual data in use.

The only oddity with the config I can see is up until an hour ago it had 12.7TB (ie 100%) allocated, which I have since corrected.

The config.yaml file could be an old version as it does NOT include

storage2.piece-scan-on-startup: true

and if I add it to the config the node refuses to start

  • Roll the fixes out faster to the Docker nodes. Make storagenode only versions if necessary. It should not take weeks to reach them.

  • If fixes require to perform a full used space filewalker run to correct numbers, nodes may not be able to to do so until the save state and resume feature has been deployed: https://review.dev.storj.io/c/storj/storj/+/12806 . So don’t delay this feature any further.

3 Likes

This one is explained there: