The trash is unpaid?

Mitsos · June 22, 2024, 10:31pm

Not even coming anywhere near that 10% rate based on my own nodes, otherwise I wouldn’t have 10TB stuck in trash with no signs of it getting reduced anytime soon.

When all this ruckus about the test data started I said that 50% of the data will be deleted and I firmly stand by my estimation. My nodes haven’t grown at all, just replaced data with trash for the past month.

They aren’t moved as in physically taking a file and moving it to a different location. You are just updating the file’s metadata, the original file is intact in its original location. Kind of getting marked for later deletion.

brainstorm · June 22, 2024, 10:35pm

ok so ? same difference. You can call it anything you want, and rationalize it any way you want, it still does not change the fact it is just data that sits in a SNO’s storage, waiting to be possibly retrieved, but unpaid because of its special label: “Trash”

Mitsos · June 22, 2024, 10:35pm

You are barking up the wrong tree, I’m on your side

brainstorm · June 22, 2024, 10:37pm

Absolutely, that is one way to solve it. The other is to just not have a trash folder and have either paid data, or no data.

pangolin · June 22, 2024, 10:45pm

It never is a only a week. Since bloom filters are made from database backups the minimum is more like 10 days. The theoretical maximum is ∞ because of bloom filter false positive rate.

Alexey · June 23, 2024, 4:25am

I would expect that it’s implemented, since this parameter is not removed and is not marked as an obsolete.

@snorkel could you please confirm, is it used or the data is deleted right away?
I see that the BF has been sent recently

2024-06-14T07:35:41Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 82, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 44959, "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Duration": "1m31.6300996s", "Retain Status": "enabled"}
2024-06-14T17:15:59Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 666468, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1719943, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "9h12m47.9765852s", "Retain Status": "enabled"}
2024-06-19T16:18:41Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 141689, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 1765986, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "1h30m10.7027254s", "Retain Status": "enabled"}
2024-06-20T01:48:40Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 7162, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 216616, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "6m54.754044s", "Retain Status": "enabled"}
2024-06-21T15:00:20Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1205, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 210036, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2m50.3943145s", "Retain Status": "enabled"}
2024-06-22T13:55:14Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 230, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 209148, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "1m9.0478738s", "Retain Status": "enabled"}

interesting, seems the satellite 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs (EU1) sent three BFs

Alexey · June 23, 2024, 4:47am

Yes, the 2-3 weeks keeping the garbage was not planned. There were several issues, many of them should be fixed at the moment or in the rolling release.
We now sending a bigger BF, so they should reach the target of 10%

I see only one problem at the moment some kind of related to the garbage, that it’s hard to check by SNO right now because of issue with gaps in reports from the satellites (Avg disk space used dropped with 60-70%). All existing checks are based on comparing that data with the usage reported by the node (or even by the OS).

The second problem that the many nodes started to have issues with “database is locked” recently and not updated usage in the databases as a result. It’s not a bug, but we need to implement some improvements here.
The third problem - some nodes also have filewalkers failures because of timeouts from the disks, and this again did not update the usage in the databases. It’s not a bug, but we need to implement some improvements here.

BrightSilence · June 23, 2024, 5:25am

Oof, there is quite a lot of misinformation in this topic. Even some that sounds really in depth and knowledgeable. Trash is meant to recover from ANY technical issue that caused nodes to move data to trash for the wrong reasons. The last issue I remember it was being used for is when files that were copied using server side copy had 1 instance removed and deletes were sent to nodes even though other copies remained. Database rollbacks are a possible issue, but clearly not the only one. The initial main reason was in case satellites send out wrong bloom filters for whatever reason. Either way, it is NOT an end user facing feature.
Large amounts of trash is NOT a signal of bloom filters not working correctly. As trash is the RESULT of a garbage collection run. Clearly the bloom filter worked or the data wouldn’t be in trash. Now there is definitely also a big issue with uncollected garbage, which is a lot less visible as that amount isn’t reported anywhere and only estimated in the earnings calculator I made. And I can tell you that on my nodes uncollected garbage is a much bigger problem than trash at the moment. GC is working very slowly to resolve this, but bloom filters have been unreliably sent with some satellites not sending them at all for a while or old snapshots being used. It’s slowly getting better from what I can tell. But some of the needed fixes are stuck in very slow node version rollouts.

I’m not happy about the current situation, let me be clear about that. But let’s try to at least get our information straight. I don’t mind storing data in trash for a week though if that’s what’s needed to recover from possible technical issues. I do mind storing uncollected garbage for much longer than that or unnecessarily storing things in trash due to issues with test data being overwritten and caught in GC instead of TTL deletes (which skip trash).

jammerdan · June 23, 2024, 5:48am

Awfully slow. Even just to revert the bandwidth thing.

Count me in. I am absolutely not satisfied since it was discovered that the used space is not correct at all. Since then it feels like everything is a big and ugly mess. At least on the storagenode side. Storj sales seems to do well.

This is still 1/4 of a month. A couple of days we could talk about. But I prefer to get paid for my nodes being used. This is what the agreement was all about. And again I am saying if it costs Storj money to do so, they will limit the usage.
If they would have to pay for the uncollected garbage, they would be very fast to find a solution for faster removal.
We need faster collecting and then Bloom filters more often. Normally if a customer deletes, it should be in the trash the very next day.

BrightSilence · June 23, 2024, 6:01am

Sure, but consider the other side. If there is an issue, in some cases customers need to find it and report it first. Then Storj needs to investigate the cause, determine which files were wrongly deleted and send out a signal to the nodes to restore from trash. Depending on the amount nodes may also need some time to restore from trash. 7 days is not very long to begin with for all that.

As for paying for trash or uncollected garbage. I dont even see a way they could do that. They aren’t even aware of what uncollected garbage your node holds and by extension the same goes for trash. If they are going to trust the nodes to report this, I’ll add some random uncollected garbage myself.

Now I do agree that it sucks that they are not sending normal deletes to nodes anymore. Handling everything with garbage collection has led to much more unpaid data being on nodes.

jammerdan · June 23, 2024, 6:13am

But if we look at the original plan to send deletion signal to the nodes the moment the customer deletes a file, there was also no safety net. A deleted piece was a deleted piece.

My memory could be wrong but I believe that process was introduced to catch the pieces left over on nodes that did not receive the original signal.

And now we are talking about a general safety net in case something goes wrong.

This sounds like a total shift of the original idea on the backs of the SNOs. If such safety net is required then pay for it. And if 7 days is short which I agree to (not for the SNO but for Storj to find, resolve and roll back) then make it 14 days or a month, I don’t mind. But pay for it.

BrightSilence · June 23, 2024, 6:25am

You don’t need a safety net if you have a signed signal from the customer to delete a file. Technical issues in that case could only lead to wrongly not deleting the file.

And yes, using GC for everything now was a shortcut and quite an inefficient one as well. I believe it was done to fix the issue with deletes when server side copy was used. It would be much better if they ran some logic on the satellite to determine which segments are fully deleted and still send that list to nodes.

With that in mind, I see the argument to pay for trash. But there are practical complications with that as Storj doesn’t know what’s in your trash. I guess they could keep paying for files for 7 days after they were deleted. But they could be in uncollected garbage much longer than that. So that doesn’t provide any additional incentive for them to fix the current issues.

jammerdan · June 23, 2024, 6:32am

I can’t remember the exact reason why this was changed. But to delete immediately sounds still like the best and most straight forward idea.

nerdatwork · June 23, 2024, 6:35am

Direct deletes were painfully slow for customer hence the shift. You want the whole experience to be smooth like butter for the customer.

BrightSilence · June 23, 2024, 6:41am

Pretty sure I answered that.

It was asynchronous long before they let GC take over. What happens after the customer delete request is handled doesn’t impact the customer anymore. Whether satellites send that signal on to nodes or not. I also wouldn’t mine them logging deletions and sending less frequent delete logs to nodes like once an hour or even once a day. GC is just a really slow way to do it for nodes.

jammerdan · June 23, 2024, 6:43am

I don’t know how this was implemented or what exactly made it slow.

But it sounds like originally the deletion request was sent directly to the nodes while today it is sent to the satellite which signals to the customer the files are deleted and sends out the bloom filter some days later.

So my idea would be receive and replay. The customer could still receive the signal that the files has been deleted (or marked for deletion) from the satellites but then the satellites send out the deletion signal to the nodes on behalf of the customers.
And if that’s too much load for the satellite, maybe the signal could be distributed via the nodes.

Alexey · June 23, 2024, 7:14am

Seems it is happening right now:

Alexey:

2024-06-20T01:48:40Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 7162, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 216616, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "6m54.754044s", "Retain Status": "enabled"}
2024-06-21T15:00:20Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1205, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 210036, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2m50.3943145s", "Retain Status": "enabled"}
2024-06-22T13:55:14Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 230, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 209148, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "1m9.0478738s", "Retain Status": "enabled"}

st99ab · June 23, 2024, 7:14am

The problems I can see are more broad:

Bloom filters are sent by satellites way too late.
Node software is of beta rather than release quality. It must be prepared to handle all those situations when something went wrong rather than only being ready for the happy path. Including being able to resume after all kinds of failures without loosing the progress made so far.
Better visibility of what’s going on with the node, for ex. log the current status every hour (or something like that) rather than in 3 days wonder what a particular file walker achieved so far and its ETA.

Alexey · June 23, 2024, 7:26am

It’s required to be sent after the fact to do not delete what shouldn’t be removed. We should be very careful here. The minimum safe delay is 24h in my opinion. But probably we need to increase the frequency instead.

we are working on this.

The “retain” already handles a resume.
Inability of the hardware to read or write is handled too - the node will crash to protect from the disqualification. If it’s a docker node - it will be restarted automatically. Windows service is not by default, but you may configure it too. However, if the node always crashing, then you need to fix this issue with the hardware or increase the timeout (and increase the risk of disqualification unfortunately, so it’s better to solve the hardware issue instead).
The node wouldn’t start, if the mount point is not mounted
The node wouldn’t start, if SNO provided a wrong identity
etc.

Please do not think that we do not working on that, but we have limited resources, so SNOs are forced to wait. However the Community may also help and contribute to fix issues by sending PRs, this will increase a speed of implementation the needed features.

The Community pull requests are very welcome!

Mitsos · June 23, 2024, 7:37am

I can’t provide every single scenario in every single reply, I’m assuming that the reader applies some general understanding about a topic I reply to. Sorry if that came out as misinformation.

BrightSilence:

Large amounts of trash is NOT a signal of bloom filters not working correctly. As trash is the RESULT of a garbage collection run. Clearly the bloom filter worked or the data wouldn’t be in trash. Now there is definitely also a big issue with uncollected garbage, which is a lot less visible as that amount isn’t reported anywhere and only estimated in the earnings calculator I made. And I can tell you that on my nodes uncollected garbage is a much bigger problem than trash at the moment. GC is working very slowly to resolve this, but bloom filters have been unreliably sent with some satellites not sending them at all for a while or old snapshots being used. It’s slowly getting better from what I can tell. But some of the needed fixes are stuck in very slow node version rollouts.

I didn’t say it was a signal of blooms not working. There has been uncollected trash piling up on my nodes (YMMV) for years. The old bloom filters fell apart around 8TB of stored data. After that the nodes could not keep up with uncollected trash. The stored amount went up and trash stayed the same. The monthly payout was lower every month after 8TB. This is very important because it shows that the stored data doesn’t match what is counted as stored. This was across a dozen nodes running on a half a dozen different hosts. This leads us to my next point.

Currently the only usage I see from my nodes is constantly moving data into trash. That means that either the clients are constantly deleting (not the case with US1, EU1 and AP1 slight decrease, saltlake up), or there IS a lot of uncollected trash that the bloom filters keep missing time and time again. I don’t see the 10% rate for missed data. In my case it’s 50%+. Again, YMMV. Let me be crystal clear: No matter how much data is moved to trash, no matter how much of that is getting deleted (actually deleted from trash) my trash value isn’t dropping below ~10TB across all my nodes. Day after day, week after week, month after month. Could say year after year as well. Used space was run 2 times in the past month for every node and I’m again running it one by one for a third time.

100% agree.

Disclaimer: Your node’s behavior may or may not match my observations. It is depended on a lot of different factors including but not limited to bandwidth, disk capabilities, processing power, distance (latency wise) from a client and/or satellite. These tests were conducted in a controlled environment by professionals and should not be replicated at home.