before the 4th July break my nodes ran as usual, then storj paused the test data due to the 4th of July, after they continued the tests, my node run as usual the first days, but now the cpu almost hits 100% 24/7 (real 100% not 100% i/o wait). Does anybody have the same “issue”? At my parents house my second node is running as usual. Is there something wrong with my system? Almost all demanding processes seem to be storj.
The funny thing, that it uses only the one core up to the top.
I just saw it with htop. One of the nodes uses one of 8 cores to 100%. But the general load is 12% (there are three nodes).
It doesn’t really describe the situation where it’s somehow causing the race condition of using a full core when two are competing, but this seems to be the intended fix in 1.108:
I’ll update my config with retain.concurrency: 1 - should fix it for nodes not yet on 1.108.
How can I upgrade all of my Nodes to the latest version? My flag in docker compose is":leates", but when redeploying, I’m still at 1.105. Unfortunately the link for the /25 subnet filter explaination doesn’t work (Error 404). What does that mean? Is the traffic now being shared in /25 instead of /24 subnets?
Did you set this in the config file, or in the docker command? If this “issue” existed before, why it only causes problems now, and not before? The high cpu load cam out of the blue.
This happens because one of the threads of GC actually hangs (enters an infinite non-productive loop).
And since it’s only one thread, it can only take up one core max. But do it at 100% load non-stop.
In my case, it did not even respond to requests to restart the node - all other threads shut down correctly, but there was one that did not respond to commands. /mon/ps has shown that this thread is related to the garbage collector.
So I even had to kill the node process to restart - because I waited for more than an hour and this last thread never finished working, while it did NOT perform any disk operations, it just continued to load one CPU core at 100% non-stop.
I also have seen this situation several times on my nodes and I can confirm that it has always been associated with an attempts to process several Bloom Filters for the same satellite. In situations where the next one was received before the processing of the previous one was completed.
Config change to retain.concurrency: 1 seems fixed it for me too without SW update (my larger nodes are still on v 1.105.4) .
The most likely answer in my answer in the previous paragraph is that it didn’t show up before, because usually the GC managed to complete it’s work before it received a new BF for the same satellite.
The growth of nodes sizes (in terms of the number of stored files) and the high network load of the last two months served as a trigger for a previously unnoticed bug.
Super happy fun time! … Imagine the trampling of feet in the mosh pit now - Didn’t I just notice three bi-daily SLC Bloom filters in a row lately? Good times ahead ! Hoping the potential avalanche of TTL fall-off data doesn’t scare too many peeps, and give Alexey a heart attack from spoon feeding too many newbies.
Looks like another nice catch M_M! Keep up the excellent analysis.
By do not touch it, it will be updated eventually to the proper version. Or do you want to be an alpha-tester? If so, there are methods, but I wouldn’t teach you how to bypass the version control, sorry.
I have no idea what are you talking about? Could you please elaborate, what is a third party tool you have used?
No problem at all. There are always newbies, I will point them either to a documentation or the post on the forum. If we do not have any - explain it one more time.
406c2c3 nodeselection: support subnet filter with any bit size (/25)
I wanted to know what that means, the link to github doesn’t work, I don’t know what you mean with third party tool.
By do not touch it, it will be updated eventually to the proper version. Or do you want to be an alpha-tester? If so, there are methods, but I wouldn’t teach you how to bypass the version control, sorry.
So you mean, I should “fix it” by editing the config.yaml, and set value 5 to 1 by myself?