The storagenode gets data from the network and stores it… if the disk access time is to high, the excess data will have to be stored somewhere… and thus the ingress data will be stored in memory until the storagenode can get time to write it to disk.
these are some examples of what that will look like…
keep in mind, depending on the graph model not all methods will show the cpu utilization as 100%, but this isn’t real cpu utilization either, it’s IOwait, which is when the cpu is waiting for data from a drive before it continues, which can slow a system to a crawl.
it’s clear that my storagenode memory utilization is running away and it will keep consuming more and more memory until something either goes wrong or until the available hdd bandwidth and or IO exceeds the storagenode requirements.
How to solve the IOwait…
There are a few options… the easy option is shutdown stuff you don’t need using bandwidth or IO on the storagenode hdd…
another option could be to move your database to another drive… not sure how much performance that actually gives tho…
my preferred solution when a software fix isn’t an option because of hardware limitation is the addition of more storagenodes… since data is evenly distributed across the nodes on a network.
then getting an extra hdd and adding an additional node will decrease the load on one node by 50%
There should never be more than one node on one hdd!
After the issue have been dealt with … in my case a local VM i’ve had some trouble with… seems it may interfere with netdata as i think it started just as i restarted my netdata…
both are running some advanced semi diy scripts, so it’s possible some of their coding is similar or whatever…
I digress…
The VM’s was all killed also seemed to have lost access to most of them, maybe because of the IOwait thing or because i set a significant swap allowance on them…
cpu utilization is back in 5-10% or so…
the ram usage is now erratic but the curve has been flattened and ram usage is slowly dropping… will give it a day and see if it’s gone back to normal…
a node restart might be required to purge the excess cache created…
and 5 hours later the memory usage is now starting to be closer to normal,
even if still slightly elevated…