Bloom Filter Processing Order Issue: Held Back Filters Not Processed

jammerdan · August 10, 2024, 4:06am

Here is something that I just noted:

Due to the known issue

github.com/storj/storj

High CPU usage by node caused by receiving a new bloom filter while still processing the old one. Also new gc won't start after processing old bloom filter

opened 06:01PM - 05 May 24 UTC

andrewadp

Bug

## Description Receiving a new bloom filter while still processing the old one …cause high CPU usage by "storagenode" process. Also new gc won't start after processing old bloom filter. ## Steps to reproduce the issue: 1. The node receives a bloom filter from the US satellite and starts the gc-filewalker: ```2024-04-22T03:18:05Z INFO retain Prepared to run a Retain request. {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-04-10T17:59:59Z", "Filter Size": 1351678, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYLZHeLUtdps3S"} 2024-04-22T03:18:05Z INFO lazyfilewalker.gc-filewalker starting subprocess {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-04-22T03:18:05Z INFO lazyfilewalker.gc-filewalker subprocess started {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-04-22T03:18:05Z INFO lazyfilewalker.gc-filewalker.subprocess Database started {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"} 2024-04-22T03:18:05Z INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker started {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "createdBefore": "2024-04-10T17:59:59Z", "bloomFilterSize": 1351678} ``` 2. The node is large and using lazy mode. Garbage collection takes few days but hasn’t finished yet when a new bloom filter is received from the US satellite. After that, the node starts to create high CPU load. Additionally, bloom filters from other satellites were received and processed normally during gc caused by US bloom from step 1 (not sure if it affects the problem but it should be noticed) 3. The GC, which started in step 1, finally finished, but for some reason, the node doesn’t remove the old bloom filter and doesn’t start a new GC-filewalker for the bloom filter received in step 2. High CPU usage doesnt go away. No new gc for US satellite after that records in log: ```2024-04-28T13:45:38Z INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "piecesSkippedCount": 0, "Process": "storagenode", "piecesCount": 11148904, "trashPiecesCount": 5425545, "piecesTrashed": 5425545} 2024-04-28T13:58:39Z INFO lazyfilewalker.gc-filewalker subprocess finished successfully {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-04-28T13:59:18Z INFO retain Moved pieces to trash during retain {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 5425545, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 11148904, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "154h41m12.177192812s", "Retain Status": "enabled"} ``` Every one of 10 nodes below except storagenode9 received new bloom while still processing the old one. As you can see all nodes except storagenode9 have high CPU load after that. ```CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5bd8f030ec34 storagenode1 37.09% 427.2MiB / 31.25GiB 1.34% 31.2GB / 70.5GB 3.65TB / 17.6GB 72 11a7ff36c845 storagenode2 45.01% 516.2MiB / 31.25GiB 1.61% 267GB / 95.3GB 2.78TB / 175GB 136 68ff2fe420e6 storagenode3 38.60% 686.8MiB / 31.25GiB 2.15% 2.53GB / 57.9GB 3.3TB / 618MB 140 8a387756a84e storagenode4 90.21% 650.1MiB / 31.25GiB 2.03% 146GB / 75.1GB 3.04TB / 97.7GB 121 dc3125d0d3b1 storagenode5 48.91% 441MiB / 31.25GiB 1.38% 126GB / 81.3GB 2.19TB / 76GB 186 e7f99bf82505 storagenode6 97.39% 471.2MiB / 31.25GiB 1.47% 398GB / 98.5GB 3.17TB / 232GB 193 d77731e0486a storagenode7 42.65% 614.5MiB / 31.25GiB 1.92% 513GB / 120GB 3.49TB / 315GB 178 dd657883d73b storagenode8 71.35% 575.2MiB / 31.25GiB 1.80% 113GB / 98.1GB 3.95TB / 66.5GB 132 b3ecab746d68 storagenode9 4.27% 767.6MiB / 31.25GiB 2.40% 313GB / 101GB 2.16TB / 202GB 154 f99a67a846c9 storagenode10 91.78% 443.6MiB / 31.25GiB 1.39% 160GB / 90.1GB 2.54TB / 94.6GB 186 ``` ## Expected results Receiving new bloom filter when old one is still processing should not cause high CPU load. After processing old bloom filter node should start new gc-filewalker caused by new bloom filter. ## Received results Receiving new bloom filter when old one is still processing cause 100% CPU load by storagenode process. After processing old bloom filter new gc-filewalker caused by new bloom filter wont start. ## Possible Fix Right now only possible workaround is restart. After manually restarting the node, it removes the old bloom filter and starts GC for the new one. However, restarting is not an ideal way to fix the problem because some other lazy filewalkers (used-space) will start from scratch. ## Additional information I've seen this problem on many of my nodes and have confirmations from other SNOs. Additional info can be found here: #https://forum.storj.io/t/high-cpu-usage-by-nodes-caused-by-receiving-a-new-bloom-filter-while-still-processing-the-old-one/26007 ## Environment Ubuntu 22.04. Docker. Storj 1.101.3 and 1.102.3 (both versions affected) Both problems are critical and easy to reproduce, i really hope it will be fixed ASAP.

, the retain concurrency is currently set to 1, which means that only one Bloom filter is processed at a time. This has resulted in a backlog of Bloom filters waiting to be processed. It looks like this:

ls config/retain
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722448799995125000.pb  ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1722880799999530000.pb
pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa-1722802182043487000.pb  v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa-1722967199982862000.pb
qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa-1723053598309773000.pb

The current processing order appears to be alphabetical, with the system processing Bloom filters from the same satellite in sequence. For example, after processing the first Bloom filter for the Saltlake satellite for a week, the system immediately started processing the second Bloom filter for the same satellite, leaving out Bloom filters from other satellites.
I don’t think this is how it should be. I believe that the system should process Bloom filters in a First-In-First-Out (FIFO) order, based on their arrival dates. According to the last modified = arrival dates the correct processing order should be:

1. pmw (03.08. 11.41)
2. v4w (08.08 04.33)
3. pmw (08.08. 09.46)
4. qst (08.08. 22.33)
5. ukf (10.08. 01.46)

Otherwise the system would always keep processing only Saltlake filters as long as new Saltlake filters are present. This means retaining and trash deletion for the other satellites are effectively blocked.

Furthermore, I think that with concurrency > 1, the system should process Bloom filters from different satellites concurrently, rather than processing multiple filters from the same satellite in parallel. This would also help to prevent the system from getting stuck processing filters from a single satellite, and ensure that filters from other satellites are processed as well in a timely manner.

Alexey · August 10, 2024, 7:33am

I do not think that it’s matter, because all BFs should be processed anyway. If the prepared BF takes too long time, that’s also mean that your node likely has here most of the garbage.

jammerdan · August 10, 2024, 7:43am

I believe it does matter. Satellites are independent and share the same available free space. SNO does not have any influence at what time and how many Bloom filters a satellite operator is going to send.
Also a SNO does not have influence how the satellite is named. I don’t think a satellites BF should be always processed first just because its name starts with a.
If it really does not matter in your opinion then there should be no reason not to switch it to a sort order by date first. You could also make this configurable.

Alexey · August 10, 2024, 7:54am

I do not think that we sort it at all. It’s a default list, returned by the OS.

jammerdan · August 11, 2024, 2:29am

I can imagine this is the case. As originally the Bloom filters didn’t even get stored I am sure nobody thought of a case that they are stacking up and multiple of them waiting to be processed.

The OS has no concept of concurrency, satellites or independency. So it needs to be told which is the correct order to return that list.
This should be fixed together with the concurrency issue to make Bloom filters processed by date and to implement concurrency per satellite rather than processing multiple Bloom filters from a single satellite at the same time.

Alexey · August 11, 2024, 5:05am

I still not sure that we need to spend time on this. All BF should be processed and order doesn’t matter with concurrency 1.
Or do you mean, that processing of two BF for the same satellite with concurrency 2 or higher may be a root cause of hanging the retain process, because they blocks each other?

jammerdan · August 11, 2024, 6:05am

No it’s 2 separate issues. But:
The reason we cannot use concurrency > 1 at the moment is that the node will happen to process 2 or more Bloom filters of the same satellite. The issue describes problems when processing Bloom filter from the same satellite not from different satellites. One reason this happens is also the order of processing when there are more than one Bloom filter for a satellite. That’s why my suggestion could also help a little with this issue. (Not in all cases, but in some)

I don’t know if the issue will be solved in a way that makes it possible to process 2 Bloom filters at the same time. But even if, I would still prefer that with concurrency > 1 different satellites get processed rather than multiple Bloom filters from the same satellite.
But also in that case ordering by modified date has better outcome than alphabetical ordering.

Toyoo · August 11, 2024, 10:53am

Seems like this approach would be helpful for you, so bumping it into this thread. Just wondering how many other nodes are affected—if it’s only yours, then there’s probably not much of value.

Storj does seem to have access to metrics from nodes, but I’m not sure there’s a metric “how many bloom filters are waiting to be processed”. If there was one, they would be able to learn about other nodes.

jammerdan · August 11, 2024, 2:15pm

Here is something that I have just found:

https://review.dev.storj.io/c/storj/storj/+/13081

With concurrency flag set to 1 retain queue will process entries one by one ordering filters with CreatedBefore fields (from oldest to newest value).

Is this relevant? I dont’t know anything about a CreatedBefore field. Maybe I missed that.

There are also some interesting comments. It seems there is some thinking about filtering. But I don’t understand what the result is they came up with.

Alexey · August 12, 2024, 3:50am

It’s ordered by the CreatedBefore, not by a filename.

CreatedBefore it’s a date range (shifted to the past) where BF should operate, without that date the filter may delete newly uploaded pieces (because on the moment of the BF generation these pieces didn’t exist yet).

You may see it in the logs filtered by retain and Prepar.

grep retain /mnt/x/storagenode2/storagenode.log | grep -E "Prepar|Move"

jammerdan · August 12, 2024, 4:00am

If that’s the case then it is even better.
I need to check if I can observe that on my nodes. Until now it appeared to be in alphabetical order.

Obviously the CreatedBefore information is not clearly visible. Where is it written and this would mean the node scans all Bloom filters for this information and based on this decides which one to retain?

Alexey · August 12, 2024, 4:14am

In the BF file, however, you may use logs to see it in the decoded form.

jammerdan · August 12, 2024, 4:56am

But something does not add up. These are the Bloom filters with the modified dates on one of my nodes:

pmw 12.08. 01.24
qst 09.08. 03.53
ukf 11.08. 23.48
v4w 07.08. 23.06

On the file system I can see from the timestamps of the subdirectories that it is currently retaining the v4w BF. It started yesterday at around 23.57. with subdir letter aa as expected.

But I can also see that before that it was processing the ukf BF. From the logs it processed it until yesterday at about 23.55.
But it seems that process got interrupted as it stopped with letters ul while it would be expected that it should be at zz when it has finished. Also the BF for US1 is still there.
Before ul it was processing subdir uk which has a timestamp 11.08 23.33. Then it proceeded to ul which has a timestamp 11.08. 23.55.
The BF for the US satellite has timestamp 11.08.23.48.

And at the moment it is processing EU1 BF with timestamp 07.08. 23.06.

I don’t understand that pattern.

Alexey · August 12, 2024, 5:08am

it’s a newest one for US1 as far as I can see.

it’s a US1 BF. Do you know why it was interrupted? Did you have an error or the node is restarted?

It doesn’t matter. The only matter is a creation date of the piece and what pieces the BF is processing (see the “Created Before” date in the “Prepared” for retain). The date of the folder doesn’t matter.
And since you seems referring to the access date, I would assume that you mounted this drive to use atime, which slow down the filewalkers a little bit on every file. When you have millions of files it become noticeable.

And it’s logical, because this BF has an oldest creation date (and likely the oldest “Created Before” inside).
So the only question, why US1 BF was interrupted. Did you check, maybe it is finished, you need to check for “retain” and “Moved”. And this one for ukf (US1) - is a fresh one, because the oldest one has been processed.

By the way, it’s better to track folders differently:

jammerdan · August 12, 2024, 5:33am

I had to edit my previous post as I had the subdirs in the wrong order.

To clarify:
I see timestamp processing for uk subdir: 11.08. 23.33 and ul subdir: 11.08. 23.55

Yes, the US1 bloom filter is the newest.
The timestamp I am referring to is the modified time. The partition is mounted with noatime.
So when I look at the timestamps of the subdirectories it seems to me that subdir uk must have been processed by a different Bloom filter than the current one with timestamp 11.08. 23.48.
So I am wondering where did it go as it seems it did not finish regularly. Maybe I am wrong but I would expect it to finish with subfolder zz and not stop with uk.

Probably due to some timeout.

Alexey · August 12, 2024, 5:46am

You may be sure that it’s not finished, if you have Prepared, but not Moved for the retain process and the particular satellite.
It’s possible that there simply weren’t any pieces found to move from those folders, that’s all.

then you would have this error, it will be retain and context canceled or any other error during the way.

jammerdan · August 12, 2024, 5:51am

Logs got wiped due to restart so I cannot verify the exact reason unfortunately.

Alexey · August 12, 2024, 5:53am

Then it’s likely the reason of the interruption, I can guess.

jammerdan · August 12, 2024, 6:10am

In the meantime 2 BF have finished processing.

I am seeing now following BFs left:

pmw 12.08. 01.24
ukf 11.08. 23.48

Dates are modified time.
It is currently processing the pmw Bloom filter:

2024-08-12T05:48:59Z    INFO    retain  Prepared to run a Retain request.     {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-08-06T17:59:59Z", "Filter Size": 765186, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

So is "Created Before": "2024-08-06T17:59:59Z" the date the Bloom filter has been created? If so, why does it have a modified date 12.08. assuming that this is the date it arrived. That would mean it took almost a week after creation until it finally was received by the node. This is long.
Also this would mean, that the ukf Bloom filter must have a later creation date, correct? Because the pmw Bloom filter is getting processed first. So the ukf filter was created later and arrived earlier on the node?

Alexey · August 12, 2024, 6:59am

Only this date is matter:

You may see when the BF has been generated by the Linux epoch in the filename, but remember, you need to use only first 10 numbers.