High cpu load caused by storj containers

gingerbread233 · July 15, 2024, 4:28pm

Hello,

before the 4th July break my nodes ran as usual, then storj paused the test data due to the 4th of July, after they continued the tests, my node run as usual the first days, but now the cpu almost hits 100% 24/7 (real 100% not 100% i/o wait). Does anybody have the same “issue”? At my parents house my second node is running as usual. Is there something wrong with my system? Almost all demanding processes seem to be storj.

TiA

flo82 · July 15, 2024, 7:38pm

I had this issues - restarted the node and everything was fine.

gingerbread233 · July 17, 2024, 5:04am

I restarted all storj containers, and the issue seem to be fixed now. Thank you.

Alexey · July 17, 2024, 7:10am

The funny thing, that it uses only the one core up to the top.
I just saw it with htop. One of the nodes uses one of 8 cores to 100%. But the general load is 12% (there are three nodes).

gingerbread233 · July 17, 2024, 7:27am

Is that a Docker related thing, or is it possible to make this more efficient in future node versions?

gingerbread233 · July 17, 2024, 11:54am

Seems like high CPU usage is back after around 12 hours.

mike · July 17, 2024, 7:38pm

I’ve noticed similar behavior - mostly on 1.107.3.
At the same time, those nodes had more than one BF from same satellite in the retain folder.

I think this was covered as a bug in another release?

Alexey · July 18, 2024, 7:52am

I hope so, but you may check:

mike · July 18, 2024, 9:45am

It doesn’t really describe the situation where it’s somehow causing the race condition of using a full core when two are competing, but this seems to be the intended fix in 1.108:

I’ll update my config with retain.concurrency: 1 - should fix it for nodes not yet on 1.108.

Thanks!

gingerbread233 · July 18, 2024, 10:38am

How can I upgrade all of my Nodes to the latest version? My flag in docker compose is":leates", but when redeploying, I’m still at 1.105. Unfortunately the link for the /25 subnet filter explaination doesn’t work (Error 404). What does that mean? Is the traffic now being shared in /25 instead of /24 subnets?

gingerbread233 · July 18, 2024, 10:39am

Did you tested it? Did it resolve the issue with the high cpu load?

mike · July 18, 2024, 10:40am

It works for me with retain.concurrency: 1 set in config for nodes < 1.108

gingerbread233 · July 18, 2024, 10:44am

Did you set this in the config file, or in the docker command? If this “issue” existed before, why it only causes problems now, and not before? The high cpu load cam out of the blue.

mike · July 18, 2024, 12:37pm

I set it in the config.yaml file, but I’m sure either way is just fine.

I do not know why it has not been a problem before

Mad_Max · July 18, 2024, 10:30pm

This happens because one of the threads of GC actually hangs (enters an infinite non-productive loop).
And since it’s only one thread, it can only take up one core max. But do it at 100% load non-stop.

In my case, it did not even respond to requests to restart the node - all other threads shut down correctly, but there was one that did not respond to commands. /mon/ps has shown that this thread is related to the garbage collector.
So I even had to kill the node process to restart - because I waited for more than an hour and this last thread never finished working, while it did NOT perform any disk operations, it just continued to load one CPU core at 100% non-stop.

I also have seen this situation several times on my nodes and I can confirm that it has always been associated with an attempts to process several Bloom Filters for the same satellite. In situations where the next one was received before the processing of the previous one was completed.

Config change to retain.concurrency: 1 seems fixed it for me too without SW update (my larger nodes are still on v 1.105.4) .

The most likely answer in my answer in the previous paragraph is that it didn’t show up before, because usually the GC managed to complete it’s work before it received a new BF for the same satellite.
The growth of nodes sizes (in terms of the number of stored files) and the high network load of the last two months served as a trigger for a previously unnoticed bug.

Julio · July 19, 2024, 6:48am

Super happy fun time! … Imagine the trampling of feet in the mosh pit now - Didn’t I just notice three bi-daily SLC Bloom filters in a row lately? Good times ahead ! Hoping the potential avalanche of TTL fall-off data doesn’t scare too many peeps, and give Alexey a heart attack from spoon feeding too many newbies.

Looks like another nice catch M_M! Keep up the excellent analysis.

Alexey · July 19, 2024, 8:32am

By do not touch it, it will be updated eventually to the proper version. Or do you want to be an alpha-tester? If so, there are methods, but I wouldn’t teach you how to bypass the version control, sorry.

I have no idea what are you talking about? Could you please elaborate, what is a third party tool you have used?

Alexey · July 19, 2024, 8:35am

This is interesting! Thank you!

Alexey · July 19, 2024, 8:38am

No problem at all. There are always newbies, I will point them either to a documentation or the post on the forum. If we do not have any - explain it one more time.

gingerbread233 · July 19, 2024, 9:24am

406c2c3 nodeselection: support subnet filter with any bit size (/25)

I wanted to know what that means, the link to github doesn’t work, I don’t know what you mean with third party tool.

By do not touch it, it will be updated eventually to the proper version. Or do you want to be an alpha-tester? If so, there are methods, but I wouldn’t teach you how to bypass the version control, sorry.

So you mean, I should “fix it” by editing the config.yaml, and set value 5 to 1 by myself?