Current situation with garbage collection

Toyoo · May 3, 2024, 12:53pm

Ah, sorry, I somehow missed the number of pieces given, I just took a guess. 20 TB is a nice round number still within the recommended size of a node, and if such node had a true random sample of all pieces in the network, that would be the number of pieces it would store.

For BrightSilence’s node, for 4 hashes by that calculator we get a false positive rate of 21.6%, while for 2 hashes it is reduced to 19.0%. For this node the difference is small, but the effect compounds after few iterations.

Given we observe that the average piece size is smaller and smaller, and nodes keep growing, it might still be worth making this small change in code to re-estimate the optimal number of hashes.

snorkel · May 4, 2024, 1:47am

Finnaly the new 10MB BF was processed. This is my oldest node, 40 months, Synology, 18GB RAM, Exos drive - almost 24h . Ready for the next one.

2024-05-03T01:17:10Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-23T17:59:59Z", "Filter Size": 10000003, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-05-04T00:42:33Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 8841137, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 41333620, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "23h25m22.955227817s", "Retain Status": "enabled"}

The Used Space finnaly looks good, coming down from 12.4TB.

Toyoo · May 4, 2024, 3:48pm

Oh, that’s a nice number.

Going from k=3 (recommended for the optimal node size) to k=1 here reduces the false positive rate from 48.9% to 40.3%.

elek · May 4, 2024, 6:20pm

Totally agree. Fair point.

As far as I see the difference between the current calculation and the n/m calculation is only 1-5 %, but we need those percentages…

snorkel · May 4, 2024, 7:57pm

My biggest nodes finished retain. 7 machines running 2 nodes, 1 running 1 node.

c21 - 1GB RAM
2024-05-03T02:38:19Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 497512, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 14971763, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "4h54m57.465579453s", "Retain Status": "enabled"}
c22 - 1GB RAM
2024-05-03T11:24:50Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 870972, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 16702227, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "6h50m56.643776581s", "Retain Status": "enabled"}
o11 - 10GB RAM
2024-05-04T11:56:39Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 8920068, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 39679616, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "38h13m18.232162572s", "Retain Status": "enabled"}
p11 - 18GB RAM
2024-05-04T00:42:33Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 8841137, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 41333620, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "23h25m22.955227817s", "Retain Status": "enabled"}
b11 - 18GB RAM
2024-05-04T07:29:39Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 10111171, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 52834911, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "32h36m50.416953082s", "Retain Status": "enabled"}
c11 - 18GB RAM
2024-05-03T20:01:26Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 8309277, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 38004048, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "19h28m55.526847405s", "Retain Status": "enabled"}
g11 - 18GB RAM
2024-05-03T19:46:26Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 9021317, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 45557230, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "24h40m13.705897155s", "Retain Status": "enabled"}
o21 - 18GB RAM
2024-05-04T00:25:33Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 9798363, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 46085859, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "27h52m2.24455936s", "Retain Status": "enabled"}
r11 - 18GB RAM, space used 11.02TB, trash 1.74TB, sat report 10.2TB
2024-05-03T23:02:28Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 9784271, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 45714626, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "28h54m40.065068586s", "Retain Status": "enabled"}

I put the dashboard data for the last one as reference. This is now, after retain.
RAM has the biggest influence on walkers speed, including used-space and retain.
You can see the difference between o11 and p11, which have almost the same piece count and pieces removed: 38h/10GB and 23h/18GB. p11 has 2 nodes, o11 has one node.

snorkel · May 4, 2024, 8:05pm

You should start sending bloom filters for other satellites too, because the trash is empty for them and I see the discrepancy between sats average and used space increasing with 100GB/day.
Anything else…: garbage folder is empty, temp has no old pieces, trash folder looks good, no oldies or recursives.
I didn’t compared the dashboard with trash size on disk, to confirm or infirm what other report about trash not being updated in database.

Alexey · May 5, 2024, 12:19pm

We are sending, as far as I know.

elek · May 6, 2024, 8:18am

UPDATE:

We are generating 10Mb bloom filters, but we didn’t send out any new one during the weekend.

We noticed a problem: some clients reported that pieces are trashed, which shouldn’t be.

This problem looks to be related to one storagenode only: we continue to send out 10Mb files.

Weekend is spent to generate new US BFs, which is close to the end (2.7 B segments out of ~3+ B), they will be sent out very soon (will be started within 1 days)

We are discussing the following changes for the following weeks:

Sending out only 10+Mb bloom filters. No more 4Mb any more. Majority of the nodes are already updated to 1.101+ version. We don’t need to wait more.
Gradually increase max 10Mb limit to 15-18Mb.
BF generator will always use the free capacity to generate new BFs instead of waiting for the next scheduled period. (which means 4-5 days BF period instead of 7 days)

snorkel · May 6, 2024, 8:41am

Wow! That’s not good! Can you tell us more?
If it’s only one SN, the operator should be notified and get some explanations from him. Maybe he modified the storagenode software or has some strage settings or a virus.
Is there a way to verify if all other nodes trashed the correct pieces?
We should be realy sure before moving forward, that we are not all affected.
This is exactely why we should not delete the trash manualy. Let the node software do it’s thing.

littleskunk · May 6, 2024, 8:45am

We have an alert for that. The way it works is that the storage node treats pieces in the trash folder different. If an audit comes in the node will restore the piece from trash and respond to the audit but with one additional flag. It will signal the satellite that something is wrong here and the satellite will inform us that one of the nodes has returned that alert flag.

elek · May 6, 2024, 8:53am

Yes, we are working on it, and doing very similar things what you suggested. And that’s why we stopped sending out BFs for the weekend, just to be extra cautious.

Audit continuously monitor the network, but in addition that, storagenode can signal to the client if the requested pieces were served from the trash instead of the normal directory. This is where we got the alerts (clients on gateway-mt reported the issue)

That’s happened for one single storage node, and only for a few dozens of requests out of thousands. The affected storagenode was not part of the 10Mb experiments (when I sent out BFs manually).

Still we are investigating the issue (and keeping our eyes on all metrics). There can be other reasons for this as well, but some of them should affect multiple storagenodes (too early TTL deletion? time skew? or just using some specific cheating strategy…)

So no time to panic, and strongly agree with your call to be careful…

Seb · May 7, 2024, 2:19am

It seems that 99.5% of the data on my node is uncollected garbage (this has grew exponentially in the last 24 hours, from 25% to nearly 100%). Is this part of the same problem? Anything I can do to avoid the issue moving forward?

uncollected garbage

snorkel · May 7, 2024, 3:10am

Let it be. When you gatter 10+ TB of data, you can keep an eye on numbers. In the GB range just ignore it. It’s normal.

agente · May 7, 2024, 8:30am

same for me with a big node of 14tb. seems a problem with earnings script (i use 13.4.0)

Seb · May 7, 2024, 12:23pm

The dashboard tells me the same. You can see on the graph the drop in data stored on May 6th.

wildwaffle · May 7, 2024, 2:00pm

My node is started only week ago, and even vetted yet. But it already consists mostly of garbage.

This surprises me. This doesn’t seem like a problem with the garbage collector not picking it up. It’s like the satellites send garbage for storage.

wildwaffle · May 7, 2024, 2:01pm

BrightSilence · May 7, 2024, 2:23pm

Hey @Seb and @wildwaffle please see the below post.

Alexey · May 11, 2024, 7:56am

This is a separate issue. This graph shows info from the satellites: they sends reports to the nodes regularly, but if the tally took more time than expected, your node will not receive a report in time, thus there will be a gap.
Usually it is closed in the next few days, when your node will receive missing reports.

Toyoo · May 16, 2024, 9:11pm

Please see this comment: Announcement: major storage node release (potential config changes needed!) - #70 by pdeline06