Debugging space usage discrepancies

JWvdV · February 3, 2024, 1:53pm

All filters that have been passed by the satellite up till now, have been successfully processed. As you can see in the output above.

But you’re essentially saying, it’s not a disk usage discrepancy up till now. But actually an idiosyncracy of the satellite?

Alexey · February 4, 2024, 3:31am

But I can see only two. Do you have only two allowed satellites?

JWvdV · February 4, 2024, 7:54am

No, but I restrict logs to 30MB and remove them on each startup.

But this happened:

root@STORJ6:/# docker logs storagenode 2> /dev/null | grep gc-filewalker
2024-02-01T16:24:26Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess      {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-02-01T16:24:26Z    INFO    lazyfilewalker.gc-filewalker    subprocess started       {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-02-01T16:24:26Z    INFO    lazyfilewalker.gc-filewalker.subprocess Database started {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2024-02-01T16:24:26Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started     {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "createdBefore": "2024-01-24T17:59:59Z", "bloomFilterSize": 752311}
2024-02-01T16:24:44Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesCount": 1344635, "piecesSkippedCount": 0}
2024-02-01T16:24:44Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully  {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-02-02T15:43:38Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess      {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-02-02T15:43:38Z    INFO    lazyfilewalker.gc-filewalker    subprocess started       {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-02-02T15:43:38Z    INFO    lazyfilewalker.gc-filewalker.subprocess Database started {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode"}
2024-02-02T15:43:38Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started     {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode", "createdBefore": "2024-01-29T17:59:59Z", "bloomFilterSize": 40810}
2024-02-02T15:43:39Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed   {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode", "piecesCount": 68969, "piecesSkippedCount": 0}
2024-02-02T15:43:39Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully  {"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-02-04T00:29:48Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess      {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-02-04T00:29:48Z    INFO    lazyfilewalker.gc-filewalker    subprocess started       {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-02-04T00:29:48Z    INFO    lazyfilewalker.gc-filewalker.subprocess Database started {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode"}
2024-02-04T00:29:48Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started     {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2024-01-30T17:59:59Z", "bloomFilterSize": 130561}
2024-02-04T00:29:56Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed   {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "piecesCount": 316686, "piecesSkippedCount": 0, "process": "storagenode"}
2024-02-04T00:29:56Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully  {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}

So, indeed, no disk discrepancy anymore. But essentially about 30% of the node has been trashed in about three days.

So, problem solved from disk usage discrepancy to idiosyncratic process affecting only some nodes. Essentially suggesting a non-random distribution of files or something?

daki82 · February 4, 2024, 8:29am

What the… i looked at this

today.
ingress is max 79gb a day! im confused.
now its starting to fill trash with ~+300GB
I think the new bigger bloom filter arrived?

my other node also shows this behavior, seems ok, for cleaning the network

Alexey · February 4, 2024, 10:10am

wow I see THREE satellites (out of 4) so far…

may be?

pdeline06 · February 4, 2024, 11:53am

From what version (for nodes) will changes related to the bloom filter size be made?

Roberto · February 4, 2024, 12:15pm

should be from 1.96 onwards from what I understand

snorkel · February 4, 2024, 1:16pm

Once moved to trash, if you feel lucky, you can delete the trash files. You can risk DQ for a week and get some new ingress. Or… wait one more week, and let the trash empty naturaly.

daki82 · February 4, 2024, 2:02pm

Im Sorry, but i do not care about the matter enough, next month it will be fixed.
That is the important part.

On the other hand we got more coin than we deserve.(to let the network grow).
You forget, this is still the early stage to exabyte scale. Some overhead or trash even if its TB scale does not bother me on the way.

snorkel · February 4, 2024, 4:01pm

It already started to clean up. My trash grows day by day and the discrepancy starts to shrink. Nodes don’t need to be at 1.96.6 also; only the satellites. I have 1.95 and it’s cleaning also.

pdeline06 · February 4, 2024, 5:02pm

Version 1.96 has “12dd732 satellite/gc: make maximum size of the bloom filter configurable”

Is this it is?

Do I need to change anything to the node configuration?

snorkel · February 4, 2024, 5:12pm

That’s it. No, you don’t.

zip · February 4, 2024, 6:24pm

I would suggest to not start sending bigger bloom filters before the trash cleanup changes will be implemented.
The bigger nodes might store TBs of unaccounted data and once all these TBs of small files will be moved to trash, then the nodes will struggle for weeks until majority of the trash will be gone.

Roxor · February 4, 2024, 7:02pm

I was just checking the forum to see if anyone else has seen large deletes. Like I have a new node that I think just moved 10% of its files to trash. Maybe this customer decrease is finally making it to the SNOs?

snorkel · February 4, 2024, 8:09pm

I don’t know how big are bloom filters now, but my 14TB nodes with FW ON, lazy off, and discrepancies of almost 2TB are working well and are sending pieces to trash. The trash folder reached 700GB. No cache, no db-es on other drives.

JWvdV · February 4, 2024, 8:27pm

Yeah, but I feel like we’re talking along each other and you seem not to grab the point I’m trying to make: a discrepancy of 1/3th between actual disk usage and satellite-reported usage, is quite extreme. Especially if it turns out to be due to trashing of 50% of the data on one satellite (EU-satellite), which not happened on most other nodes. Of which at least five are using the same IP-address and have exactly same uptime.

zip · February 4, 2024, 8:53pm

They are currently up to 2MiB big. This I understand is not enough to move anything to trash on bigger nodes.

Checking one node and the last US bloom filter was received Feb/1st and it was maxed out at 2097155 bytes.
You can check this in the logs where it says bloomFilterSize:

2024-02-01T11:52:13Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "createdBefore": "2024-01-24T17:59:59Z", "bloomFilterSize": 2097155, "process": "storagenode"}

This specific node has 2TB of unaccounted data and it apparently didn’t move anything to trash the last time despite GC finishing successfully in about two hours with LVM SSD cache. The trash on this node is at about 43GB at the moment.
The new bloom filters might be I believe either 4 or 5MiB big. I believe they are also planning to tune it to reduce the size needed, trading the size for more false positives.

The thing I’m afraid of is the trash folder cleanups, which is quite I/O heavy at the moment, and will be even more when it will have TBs of trashed data.

snorkel · February 4, 2024, 10:09pm

I am on fatal log mode and won’t change to info. One of my new nodes already cought up with the sats, but they are 1 month old. The big ones are getting close.

nyancodex · February 5, 2024, 8:04am

got the same problem. One of my nodes got 15% sent to trash in one day. Something is happening.

tylkomat · February 5, 2024, 8:13am

50% of the data that you received from that satellite, not 50% of the whole satellite.