The trash is unpaid?

pangolin · June 23, 2024, 10:41pm

For US1 my numbers are 6M pieces per TB. How big would be the 24 TB bloom filter now?

Toyoo · June 23, 2024, 10:47pm

I linked the calculator, you can just change the numbers.

Mitsos · June 23, 2024, 10:47pm

If I’m using the calculator that @Toyoo provided correctly, it comes out to ~82MB

Toyoo · June 23, 2024, 10:50pm

Yeah, that’s why I said it’s tricky to estimate. It used to be simple to just translate the average segment size into average piece size by dividing by 29. It’s no longer so anymore. Right now I see the number of 4.29 MB average segment size, which by old RS numbers leads to 148 kB piece size. Then 24 TB would store 162M pieces, which doesn’t sound plausible to me. Only @littleskunk knows how to estimate the relevant numbers now.

pangolin · June 23, 2024, 10:54pm

Why should this be not plausible? My real numbers for average file size are close to that.

|SLC|141879|
|AP1|253094|
|US1 |164628|
|EU1|677965|

Toyoo · June 24, 2024, 12:26am

Ok, well, indeed… that’s scary then. And disappointing. And discouraging.

Mitsos · June 24, 2024, 3:26am

And people were right all along about complaining. And something should be done about it at a high priority. And thanks to the community for identifying a critical part of the software isn’t working as expected.

Please go on, don’t stop there

Alexey · June 24, 2024, 4:01am

Not for the customer. Basically they cannot undelete these pieces, this feature usually implemented a client-side, not server-side. But you are correct, these pieces will be permanently deleted only after a week as a fastest possible time.

Our trash folder is a technical protection against any error with server-side deletion, it’s not offered as a backup option for the customer. If it would, the customer will pay for that and thus nodes will be paid too. Unfortunately it’s not the case, this one is a part of the protocol, not a feature for the customer.
As @BrightSilence said:

Alexey · June 24, 2024, 4:14am

You are correct, the situation with the garbage should be fixed and as fast as possible, because keeping the removed data longer than designed is not a feature, it’s a bug.
And as far as I know that the bigger BF is implemented.
We need to check does it enough for your big nodes now, but I do not know how to do it, since you have

Mitsos · June 24, 2024, 4:21am

You should start by seeing if the theoretical limits (numbers are provided by others a couple replies back) are actually workable with nodes out there. If theory says that the bloom filters can’t reliably give a 10% fail rate for the maximum expected node size, then in practice they can’t and we don’t need to go into analyzing any logs or node behavior as to why. They are simply too small.

If theory says that “nope, the current bloom filters undeniably are within 10% fail rate” then I’m sure someone out there is running a 20TB node on info log and can share his/her/its logs.

I could have sworn that I saw a 24TB max node size being mentioned in the documentation but can’t find it now. If my car’s manufacturer says this car can go up to 220km/h but I can only get it up to 120km/h, then something is either wrong with the car or the documentation. That should be the node target size for estimating the size of the bloom filters. There is no point in concentrating on 500GB nodes if the documentation says up to 24TB, IMHO.

Alexey · June 24, 2024, 4:28am

Actually even for a required 82MB we need to send 6 BFs of 14MB size I think. And perhaps we already do so, at least I saw this in my logs:

Alexey:

2024-06-20T01:48:40Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 7162, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 216616, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "6m54.754044s", "Retain Status": "enabled"}
2024-06-21T15:00:20Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1205, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 210036, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "2m50.3943145s", "Retain Status": "enabled"}
2024-06-22T13:55:14Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 230, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 209148, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "1m9.0478738s", "Retain Status": "enabled"}

Mitsos · June 24, 2024, 4:33am

Wait, how does splitting a BF work? That means each of those filters is basically telling the node to keep everything, since if the filter is that small, you can’t have the node deleting half of its stored data in one go, or am I misunderstanding something?

Alexey · June 24, 2024, 5:12am

I read this explanation:

and all tasks there:

github.com/storj/storj

:open_umbrella: {storagenode,satellite}/gc: bloom filters are ineffective with large storage nodes

opened 11:41AM - 11 Jan 24 UTC

egonelbre

Currently we have a maximum memory limit for bloom filters, however that has a s…ide-effect of them being completely filled with nodes with large number of pieces. With simulating the bloom filter effectiveness we can see different behaviours. https://github.com/storj/experiments/blob/main/simulate-bloom-filter/main.go Here's an approximate results for different piece counts and max bloom sizes: ``` satellite add/delete storage-node bloom-size ideal-bloom-size 1_000_000 50_000 1_005_000 583 KiB 583 KiB 2_000_000 50_000 2_005_000 1.1 MiB 1.1 MiB 4_000_000 50_000 4_010_000 2.0 MiB 2.3 MiB 8_000_000 50_000 8_060_000 2.0 MiB 4.6 MiB 10_000_000 50_000 10_015_000 2.0 MiB 5.7 MiB 14_000_000 50_000 14_450_000 2.0 MiB 8.0 MiB 16_000_000 50_000 16_500_000 2.0 MiB 9.1 MiB (almost unstable) 20_000_000 50_000 unstable 2.0 MiB 11.4 MiB 20_000_000 50_000 20_130_000 4.0 MiB 11.4 MiB 26_000_000 50_000 26_300_000 4.0 MiB 14.8 MiB (almost unstable) 26_000_000 50_000 26_140_000 5.0 MiB 14.8 MiB ``` So, if we use the currently calculated optimal size we are going to have a significantly smaller overhead than our false positive rate. Randomizing the seed significantly does clearly help. This however falls apart when the bloom filter is completely filled -- seems to happen around 16-20M pieces for 2MiB and 22-26M pieces for 4MiB. Currently our largest node has 26M pieces, so bumping the bloom size to 4MiB will probably help. We may also want to adjust our bloom filter size calculation to suggest 2x smaller (or 1.5x smaller) bloom size than the theoretical result suggests. Bumping it to 5MiB should solve it somewhat, however we need to be mindful of drpc packet limit, which may need to be changed -- alternatively, we need a new message type to send larger bloom filters. One interesting approach to try is to create bloom filters only for a subsection of piece ID-s, rather than all of them. This should allow to shrink the number of piece-ids put into the bloom filter, at the cost of longer tail to cleanup the thrash. If we split it into two, e.g. only pieces `<0x80...`, then our ideal sizes should be half as they are now. --- ```[tasklist] ### Draft action items - [ ] https://github.com/storj/storj/issues/6690 - [ ] https://github.com/storj/storj/issues/6733 - [ ] https://github.com/storj/storj/issues/6691 - [ ] https://github.com/storj/storj/issues/6802 - [ ] https://github.com/storj/storj/issues/6770 - [ ] Adjust bloom filter parameters such that they are smaller. We should experiment a bit more, but it does seem like 1.5x smaller size should be safe. - [ ] Try piece-id selection strategy. (e.g. what if bloom filter ignored half or quarter of the pieces) - [ ] Add a log/monkit warning when the fillrate of a bloom filter is above 0.95. ```

I assumed, that it’s possible. However, I’m prefer to get an explanation from @elek

Mitsos · June 24, 2024, 5:25am

In the first reply on the github issue tracker, the satellite value is 26_000_000 (which I’m assuming is number of pieces). I know pieces probably isn’t the right word and someone will come and correct me as soon as possible instead of focusing on the rest of this reply.

Ok, sounds like it checks out based on two independent reports. An 8TB node that exited saltlake is basically split even between US1 and EU1 (AP1 isn’t even worth mentioning). Let’s go with 4TB * 6M = 24mil pieces (which is wrong word, I know, someone will correct me soon as I said).

Let’s be clear on this: the old 4MB situation fell apart for every node that is seeing about 4TB usage per satellite. After that you are basically storing data for infinity with no payout. This was true up to the bloom filter expansions.

If the blooms are getting bigger, that’s a step in the right direction. The github issue though is 6 months old. I can understand “not a priority right now”, but this begs the question when exactly will it be a priority? When storj starts posting that “we asked everyone to add space and nobody added a single byte!!!”?

Is this too offensive and/or hostile or am I justified in being a tiny bit agitated?

elek · June 24, 2024, 5:29am

We don’t really split BF, it should be one big byte array. For technical reason, big bloom filters are sent in multiple requests, but last request contains a hash/checksum, and it will be concatenated on the SN side, and hash is checked.

Logs show the processing of the full, concatenated BFs.

You may see more frequent BF generation as we try to send them out more frequently, but there are new problems related to Saltlake, it has a lot of new segments, and scheduling should be adjusted (we need enough time to delete the US1 data from the BF generation machine, and restore SLC). That’s a technical, easy adjustment, but it makes hared to predict BF generation.

Mitsos · June 24, 2024, 5:56am

Is the file under the retain directory that triggers the actual GC run the “complete” bloom filter? What’s the limitation on the node’s side to have those huge (100MB) bloom filters, RAM?

st99ab · June 24, 2024, 6:56am

According to how many times you repeat that looks like you sincerely believe in it. For some reason it reminds me “Django Unchained”, one of the characters from it.

Ruskiem · June 24, 2024, 7:46am

Yea i get it,
all i mean is just make customer pay for the protocol, in the end, its him, who uses it, with all the the good … and some bad, like this one.
Because there is no free lunch, someone always has to pay for it.

Well if true, then wow, if that would be like, say, 11 days, a trash files time in storagenode,
then imagine the havoc SNOS will do if majority of data will be TTL with mostly 30 days, then constant rotation will make a good chunk of node always unpaid, because of trash sitting above 7 days, NOW thats little bit more serious, if 30% of full disk space happens to be unpaid, ya know? hah… oh boy.

beside, my full space-used.filewalker takes now like 9,5h for a 14TB of space.(upgrades)
if retain for bloomfiler is similar in scanning all files, is it?
then how often You want to run it? if often it will make impact on nodes disk access time for customers to get egress (read time from disk)

littleskunk · June 24, 2024, 7:54am

I am sure I have given the relevant numbers to all of you. Current RS numbers are 16/20/30/38. We might still change them a bit but for now that is what you get. (SLC only)

snorkel · June 24, 2024, 8:53am

Just to be clear for everyone, each satellite has it’s own bloom filers. So if you have a 24TB node and you run all 4 active sats, then you get 4 bloom filters that cover each satellite’s data, each satellite’s piece of that 24TB pie. You could have 10TB stored by US1, 10TB for EU1, 1TB for AP1 and 3TB for SL.
You just need BF big enough to cover that 10TB for the big sats, not the entire 24TB.
Stop making wrong calculations.
My question is: does the TTL data from Saltlake needs BF also?
Because on my nodes, it starts to approach 20TB of test data. If this needs a BF, than that’s a problem.