It’s written on GO, so no dynamic linking. It’s always statically linked build time.
It’s on the filesystem too for a long time.
It’s written on GO, so no dynamic linking. It’s always statically linked build time.
It’s on the filesystem too for a long time.
Hmm what is your point ? I said nothing about dynamic linking.
There could be a bug in in libraries used by storagenode, wether static or dynamic link is irrelevant. I am just saying it could be in code not part of the storagenode codebase.
Perhaps. But since only your setup has this issue, I’m not convinced yet that there is a bug.
Did you try this command?
Did it produce the same behavior with a high CPU usage?
yes, originally I identified the security scripts that walk through the whole filesystems, as causing the high CPU usage. This basically does the same.
And the filewalker does the same as well.
I don’t know where the bug is, but to me it’s clearly a bug. Storagenode should not be so sensitive to other processes accessing files, specially by increasing CPU use. It makes no sense at all. I won’t rehash my points from my posts above but I really do not see how a simple filesystem traversal from a different process causing storagenode to spike CPU is defensible ![]()
How is it related to storagenode, if the same thing on the OS level produces the same behavior?
We would not patch your OS to perform better, sorry.
However, you likely can limit a CPU available to the jail (I’m not sure about this, I do not have a FreeBSD setup to check).
I also wonder, why your setup behave differently from @arrogantrabbit setup though.
You misunderstood me.
Running a file traversal of the whole filesystem does not affect anything else, even other process which BTW have similar purposes and behaviors as storagenode. Processes which are heavy bandwidth, heavy IO, disk and network.
If I run those filetraversal jobs or the one liner above when not running storagenode, nothing is affected adversely. It is only storagenode, when run at the same time, that suddenly spikes cpu 1000%. And, it is the storagenode process that takes that CPU, not any other process.
Storagenode uniquely completely bogs down CPU when some other unrelated process does a whole filesystem traversal. And the pattern so far observed, is that of busy waiting somewhere on system calls. Only storagenode is affected, not other processes.
That is clearly an issue with storagenode, not the OS.
But I am open to explanations on how that is actually an OS issue, or how it’s actually perfectly ok for a program to act like storagenode does. So far I see none.
It does not. He actually sees the same behavior. He mentioned that in his setup he sees very high cpu use for 10 minutes at node startup. The reason it only lasts 10 minutes is because he has an SSD cache of metadata. So that cache already holds the metadata accessed and the whole initial scan last a shorter time. For me, without ssd cache, all those accesses go through the disk first, which have much lower IOps. So it lasts forever. He also mentioned how some have heavy IO for days. I bet cpu use also is high during that time.
Oh, this one sounds like a bug. But honestly do not understand why is it happening under FreeBSD? I did the same many times on Windows/Linux and no one time I noticed such a weird behavior.
And also, why is it not happening on @arrogantrabbit’s setup?
I don’t know, that is why I mentioned it might not be a bug in storagenode itself but in some libraries it uses. I can perfectly imagine that some base go code must be ported differently to different platforms, and some of that code ends up doing some busy waiting or whatever. Maybe the conditions are just right to reveal in storagenode use pattern. who knows.
No idea, just speculation at this point.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We use this build process for FreeBSD:
BTW, I restarted yesterday storagenode with these settings
storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: false
My idea was to force it to do the filewalker process at startup and do it as fast as possible, not throttling IO and see if after that CPU use would just fall of a cliff.
It seems to have just finished the filewalking. CPU use for storagenode just went back to normal very light load.
I’ll make more experiments to confirm but so far I think my theory is right. Reason why ? no idea. I am going to setup another node to keep investigating this.
Why I think this is quite important also from an economic point of view is that as soon as the cpu load went back to normal:
Don’t have much to contribute but just wanted to mention that I’m seeing the same issue with docker + zfs. Seems like high CPU usage (though my disks are absolutely iops constrained, I’m running 2 nodes on a raidz1 so the write iops are struggling).
Out of curiosity, @brainstorm are you also iops constrained? Wondering if it has to do with that. It shouldn’t consume cpu cycles though, the scheduler should offload iowaiting so not totally sure. Might be worth profiling.
I believe, that it’s related to a 1.105.x release:
possible, I am currently on v1.108.3 and not seeing this issue anymore.
I do not have an SSD for the log cache etc. but those are fast SAS drives. 300-500iops
I do not care if that part is slow, it should just not spike CPU while effectively doing nothing.