Another issue with trash

Now what is going on here?

Node version: 1.102.3
/storage/trash/.trash-uses-day-dirs-indicator is present

But no date directory:

/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2a

at least on that satellite.

Same node other satellite:
/storage/trash/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/2024-04-23/2a

Has this node been downgraded to 1.99 inbetween?

2 Likes

I cannot answer that. I don’t have that data and I don’t remember what version it had when I checked before.
It might be that it got restarted at some point, got involuntarily downgraded and back upgraded.

I would suggest to check it after upgrade to 1.104.x

I have the same problem on all nodes with the trash folder for us1. 2 of the nodes are on 1.104.5 and it didn’t change.

The trash folder for this satellite used to have a mix of old and new style folders, but now it only has the old style folders. On one node it’s completely empty.

Is that version supposed to heal that problem?

Doesn’t look like. I have 2 nodes on v1.104.5 which still have the problem.

It might be possible that the node downgraded at some point. I would suggest you move the trash for the satellite to a new folder with the current date:

  1. rename the satellite trash folder to a temp name

    mv /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-2024-05-17
    
  2. create a new folder for the satellite in the trash:

    mkdir -p /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa
    
  3. move the temp folder

    mv /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-2024-05-17/ /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-05-17/
    
2 Likes

Running multiple nodes this fix is to much manual work.
This needs to be fixed automatically.

1 Like

@clement
Here is my automated fixing solution:

  • Pause creation and sending of bloomfilters for 7 days.
  • After 7 days wipe the trash completely from all nodes
  • Create new folders per satellite on the nodes in the trash directory
  • Start over with creation and sending of bloomfilters.
3 Likes

I’m gonna have to agree with this @clement . If the downgrades are indeed what caused this, it can’t be up to SNOs to fix this. And it really sucks that despite warnings from us (including me) about the wrong minimum version on versions.storj.io, very early in that upgrade cycle, you did nothing to prevent those downgrades. And in fact, the same issue happened again on the next rollout. Which btw, if it happens when rolling out 1.105, downgrading nodes from 1.104 will instantly break them due to the database migration. So I certainly hope this is now permanently fixed.

The downgrades weren’t caused by SNOs. And for every SNO passionate enough to pay attention and post here, there will be at least 5-10 more not paying attention and left with trash that will never be removed. So this clean up is up to you now. Though please find an approach that doesn’t involve stopping bloom filters. We still need those badly as many nodes still have large amounts of uncollected trash.

9 Likes

Many Snos don’t speak but we noticed the issues jammerdan raised and personally I agree with him

2 Likes

The problem I see - and this is why I had suggested it - is that all started with the botching of the migration. We have so many different constellations with subfolders at various levels. Some migration partly succesful, others not at all. Mixing here, mixing there that it would take months to find a coding solution for that mess.
But the migration is a one time thing. Once all nodes are on 1.104 or something all nodes will be doing the new way.
So there might be less headaches and less coding required to halt the bloomfilter and pass the 7 day deletion grace period and then delete everything that is in the trash and start fresh.

So instead of having this cleared up in months, maybe it could be done in a couple of weeks.

Is there any work in progress to clear out the mess in the trash folders with various levels of mixed stages with old and new trash folder concept?
I have just discovered another node with such a mixed situation that is accumulating trash and can’t delete it. Due to this the node cannot receive any ingress as it is full.

I cannot see a ticket for this being worked on on Github so I wonder what the status of the progress is.

Although I fully agree with everything you said, why not stopping bloomfilters until all nodes are on a safe version (ie the one using the new trash directories) as suggested? We’ve gone months without BFs before. Most of the uncollected trash on the nodes would have gone unnoticed if we didn’t increase the BF’s size anyway.

Personally, I don’t get all this “it must be done and it must be done now!” urgency/attitude. I think everybody should relax and take a (very) deep breath. The network isn’t going anywhere. It will still be here tomorrow, it will still be here next week, and it will still be here next month.

I still have nodes on 1.104.1 that have not finished used-space. One of these was restarted 1 week 5 days ago. It can’t finish used space (lazy) because it keeps getting hammered with GC/trash-cleanup/bandwidth testing. I had to go in and manually renice the used-space in order for it to finish before the next update cycle.

The prospective clients haven’t signed up yet => stop pumping out 2 GC per satellite per week.
Nodes haven’t finished their MANDATORY used-space when upgrading to 1.104 => leave them be and give them some time to work on it.

I have 27TB of trash that is growing daily. The nodes can’t keep up with deleting. Some nodes are still working on 2024-05-03. Yes, the third of May. => this ties back to GC.

I would suggest that we give the network a bit of time to relax, but last time I suggested this the answer was somewhere along the lines of “if we don’t do it NOW, aliens will invade and enslave us all. The dead will walk and a deadly plague will sweep over the land”.

PS
for anyone saying “potato nodes” don’t even go there.

Yes, but we didn’t have the large amounts of deletes during those months. It’s been piling up recently. And it also doesn’t go unnoticed anymore.
image
Though to be fair, most of it has been cleaned up by now.

At the same time, testing requires that old data to be cleaned up. So I understand the urgency. I’m not seeing the impact you mention of the GC file walker, but then again, I’m also not using the lazy one (I can’t on Synology).

I guess it’s not possible to run a node less than a current minimum version, which is 1.104.5: https://version.storj.io/
See

The nodes are running fine, that’s not the issue. The issue is that if the node deletes (let’s go with pulling numbers out of thin air) 1TB and trash is growing by 2TB per day, yes trash will be collected eventually, but by the time that is done, the next update cycle would need to start. That’s wasting 2 weeks of used-space running for nothing, since it will need to go back to start after the update.

The other scenario is me saying “I don’t care about customer experience, I only care about my payout, so I’ll disable lazy filewalkers and the customers get any leftover scraps of IO”.

I think when the bug with the not updating Stat for the trash would be fixed, the used-space-filewalker would be needed less frequently unless you have issues with a database or if you use the same storage for your stuff.

Wait, what???
1.104.1 had this storagenode/pieces: update used space on trash-lazyfilewalker completion · storj/storj@d68abcf · GitHub

Are you saying that the issue is still not fixed? I’m seeing correct(ish) space reported after used-space is completing.