High cpu load caused by storj containers

Mad_Max · July 18, 2024, 10:30pm

This happens because one of the threads of GC actually hangs (enters an infinite non-productive loop).
And since it’s only one thread, it can only take up one core max. But do it at 100% load non-stop.

In my case, it did not even respond to requests to restart the node - all other threads shut down correctly, but there was one that did not respond to commands. /mon/ps has shown that this thread is related to the garbage collector.
So I even had to kill the node process to restart - because I waited for more than an hour and this last thread never finished working, while it did NOT perform any disk operations, it just continued to load one CPU core at 100% non-stop.

I also have seen this situation several times on my nodes and I can confirm that it has always been associated with an attempts to process several Bloom Filters for the same satellite. In situations where the next one was received before the processing of the previous one was completed.

Config change to retain.concurrency: 1 seems fixed it for me too without SW update (my larger nodes are still on v 1.105.4) .

The most likely answer in my answer in the previous paragraph is that it didn’t show up before, because usually the GC managed to complete it’s work before it received a new BF for the same satellite.
The growth of nodes sizes (in terms of the number of stored files) and the high network load of the last two months served as a trigger for a previously unnoticed bug.