Trash used space to big

wildraven · June 23, 2020, 4:33pm

on one node trash is 160 GB and on another one 120 gb trash, what can we do about that data, can we delete it or what ?

littleskunk · June 23, 2020, 4:35pm

We can just wait 7 days and then the trash will get cleaned up by the storage node.

node1 · June 23, 2020, 7:39pm

Why the time is so long? 7 days of freezed space = loosing ability to get a new data.

littleskunk · June 23, 2020, 8:53pm

Worst case we would need 7 days to fix garabge collection and recover the data. Without that safety feature garbage collection has the potential to destroy the entire network with just one bad bloom filter.

Garbage collection is plan b. If you are online the space will get cleaned up immediate. The interesting question would be why do you have a bigger amount in the trash folder? My trash folder on a full 6 TB node is currently 1.69 KB. On a second node that is still collecting new data the trash folder is 7 GB. That is a side effect of high uptime.

JoshGarza · June 23, 2020, 9:07pm

Garbage collection launched few days ago filled the trash folder. Mine is over 1TB atm.
Next saturday it will be empty so don’t worry.

node1 · June 23, 2020, 9:16pm

mine 407G on 6Tb node

donald.m.motsinger · June 23, 2020, 9:20pm

TBH my suspicion is that stefan-benten doesn’t delete the data the normal way. My oldest node had a lot of data from that satellite with 4.9TB filled and 1.1TB in trash this morning. And this server is online virtually 100%.

I noticed today, that all subfolders in stefan-benten’s blobs folder starting with a number had ~1GB each in them, then from aa down to xz only a few MB, from ya to yz some had ~1GB and some a few MB and again from za to zz all had ~1GB.

You can’t tell me that these are leftovers from normal delete operations from the client. It looks like the satellite just takes a shortcut with deleting pieces in alphabetical order of the pieceID’s. I have ~250GB in the blobs and 830GB in the corresponding trash folder for stefan-benten. My server is online almost 100% with the odd maintenance shutdown/reboot. It’s impossible that this many delete operations failed.

JoshGarza · June 23, 2020, 9:22pm

it was a data wipe, of course. 1TB of zombie segments is not possible.çç

I have a single folder in /trash with 1TB filled with aa… xz folders

litori · June 23, 2020, 9:28pm

Is trash paid or unpaid during the 7 days it sits in the trash folder?

litori · June 23, 2020, 10:12pm

I am going to have to assume that we are not being paid for the trash sitting in the trash folder since Storj is not getting paid by the customer once they request the file to be deleted from the nodes.

From reading the implementation of the Garbage Collection design it seems that the use of the bloom filter is not 100% and can generate false positives. So Storj is making SNOs store trash for up to 7 days to make sure and hope that the bloom filters from multiple bloom filters generated can catch any false positives and restore any files that might have been thrown away to trash by accident.

Correct me if I’m wrong, but it seems like SNOs are on the short end of the stick here because of an implementation that is not 100%.

Mark · June 24, 2020, 5:45am

I once heard it’s the opposite. I believe the bloom filter tells the node which files it should keep, and a false positive results in trash being kept instead of going to the trash folder. The missed files are cleaned during the next garbage collection, when a new bloom filter is sent out, assuming they don’t become a false positive again which is unlikely.

littleskunk · June 24, 2020, 9:17am

The risk with the bloomfilter is not the false negative rate. The risk has to do with the way the satellite generates the bloomfilter. The satellite goes through all pointers and adds the pieceIDs to a bloomfilter. We call that part metainfo loop. We use the same metainfo loop for the repair checker, accounting, audit reservoir sampling and a few more. For performance it is great to have only one metainfo loop but the downside is that the additional complexity comes with a tradeoff. Every time we touch the metainfo loop we risk breaking the bloomfilter creation. Worst case the satellite will send out empty bloomfilter. We had a bug like that a few month ago on the master branch but we noticed it before deploying in production. Next time we might not be lucky and it gets deployed in production. An empty bloomfilter would mean all nodes delete all pieces and by the time we notice it, it is already to late to stop it. We can close business and go home. The 7 days are needed to mitigate that risk. That gives us 7 days to send a rollback command to all storage nodes, disable garbage collection for the moment and one release later rollout a bugfix. The 7 days are just there to have a escape plan for the worst case.

BrightSilence · June 24, 2020, 9:32am

It is if it was never cleaned up before. This satellite was used for a LOT of testing with random noise. I can imagine that upload errors wouldn’t be such a big deal, since nobody cares about the specific data anyway. Ignore those long enough and don’t clean the zombie segments and you end up with this.
A while ago this satellite went through tons of normal deletes, so most of the data was cleaned up the normal way. Most of what remained after that was apparently zombie segments and stuff. I can see that happening.

No, but there normally shouldn’t be a large amount in there anyway. This is kind of an outlier situation that probably won’t happen again. If you look at it another way, you’ve been paid for zombie data for many months. I’ll forgive them these 7 days as a trade off.

Sasha · June 24, 2020, 10:27am

Hi @littleskunk, can I point you to this thread… I have 850GB of trash
ERROR pieces:trash emptying trash failed - #22 by donald.m.motsinger

Storgeez · June 24, 2020, 1:03pm

I have a lot of trash as well (700GB+), we could not all have been offline for such a long time to have generated this much missed deletes.

Sasha · June 25, 2020, 4:50am

I think the assumption of SNO being offline is flawed otherwise this would be reflected in the uptime dashboard metric. Also you would be suspended very quickly if this was the case.

There has to be a more logical explanation and it might point to the stefan b satellite deletes possibly not performing the correct delete procedure OR this new thing they’ve discovered “zombie segments”, OR something else. But it’s definitely not the SNO being offline when they’re online.

Alexey · June 25, 2020, 6:54am

You are not. Please, read the blueprint for suspension

github.com

storj/storj/blob/091b49b921060f42b6fd4b2914d0e2aa834a0ae4/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

wildraven · June 25, 2020, 8:19pm

well my trash just droped from over 160 gb to almost nothing on all my nodes…

node1 · June 25, 2020, 8:42pm

mine still have 400Gb…

Storgeez · June 25, 2020, 8:50pm

Mine got deleted earlier today. Congratulations on having a minimum of 20 characters to post.