Very hich cpu utilization since 24 june, although BW 3 times less

Alexey · June 30, 2024, 8:21am

It’s written on GO, so no dynamic linking. It’s always statically linked build time.

It’s on the filesystem too for a long time.

brainstorm · June 30, 2024, 8:24am

Hmm what is your point ? I said nothing about dynamic linking.
There could be a bug in in libraries used by storagenode, wether static or dynamic link is irrelevant. I am just saying it could be in code not part of the storagenode codebase.

Alexey · June 30, 2024, 8:28am

Perhaps. But since only your setup has this issue, I’m not convinced yet that there is a bug.
Did you try this command?

Did it produce the same behavior with a high CPU usage?

brainstorm · June 30, 2024, 8:38am

yes, originally I identified the security scripts that walk through the whole filesystems, as causing the high CPU usage. This basically does the same.
And the filewalker does the same as well.

I don’t know where the bug is, but to me it’s clearly a bug. Storagenode should not be so sensitive to other processes accessing files, specially by increasing CPU use. It makes no sense at all. I won’t rehash my points from my posts above but I really do not see how a simple filesystem traversal from a different process causing storagenode to spike CPU is defensible

Alexey · June 30, 2024, 8:43am

How is it related to storagenode, if the same thing on the OS level produces the same behavior?
We would not patch your OS to perform better, sorry.
However, you likely can limit a CPU available to the jail (I’m not sure about this, I do not have a FreeBSD setup to check).

I also wonder, why your setup behave differently from @arrogantrabbit setup though.

brainstorm · June 30, 2024, 8:50am

You misunderstood me.

Running a file traversal of the whole filesystem does not affect anything else, even other process which BTW have similar purposes and behaviors as storagenode. Processes which are heavy bandwidth, heavy IO, disk and network.

If I run those filetraversal jobs or the one liner above when not running storagenode, nothing is affected adversely. It is only storagenode, when run at the same time, that suddenly spikes cpu 1000%. And, it is the storagenode process that takes that CPU, not any other process.

Storagenode uniquely completely bogs down CPU when some other unrelated process does a whole filesystem traversal. And the pattern so far observed, is that of busy waiting somewhere on system calls. Only storagenode is affected, not other processes.

That is clearly an issue with storagenode, not the OS.

But I am open to explanations on how that is actually an OS issue, or how it’s actually perfectly ok for a program to act like storagenode does. So far I see none.

It does not. He actually sees the same behavior. He mentioned that in his setup he sees very high cpu use for 10 minutes at node startup. The reason it only lasts 10 minutes is because he has an SSD cache of metadata. So that cache already holds the metadata accessed and the whole initial scan last a shorter time. For me, without ssd cache, all those accesses go through the disk first, which have much lower IOps. So it lasts forever. He also mentioned how some have heavy IO for days. I bet cpu use also is high during that time.

Alexey · June 30, 2024, 8:59am

Oh, this one sounds like a bug. But honestly do not understand why is it happening under FreeBSD? I did the same many times on Windows/Linux and no one time I noticed such a weird behavior.

And also, why is it not happening on @arrogantrabbit’s setup?

brainstorm · June 30, 2024, 9:04am

I don’t know, that is why I mentioned it might not be a bug in storagenode itself but in some libraries it uses. I can perfectly imagine that some base go code must be ported differently to different platforms, and some of that code ends up doing some busy waiting or whatever. Maybe the conditions are just right to reveal in storagenode use pattern. who knows.
No idea, just speculation at this point.

brainstorm · June 30, 2024, 9:04am

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Alexey · June 30, 2024, 9:07am

We use this build process for FreeBSD:

github.com

storj/storj/blob/fb9a0a2b296032d2710abbecf51ee0b3a078c43c/Makefile#L434


      
          	$(MAKE) binary-check COMPONENT=uplink GOARCH=$(word 3, $(subst _, ,$@)) GOOS=$(word 2, $(subst _, ,$@))
          .PHONY: versioncontrol_%
          versioncontrol_%:
          	$(MAKE) binary-check COMPONENT=versioncontrol GOARCH=$(word 3, $(subst _, ,$@)) GOOS=$(word 2, $(subst _, ,$@))
          .PHONY: multinode_%
          multinode_%: multinode-console
          	$(MAKE) binary-check COMPONENT=multinode GOARCH=$(word 3, $(subst _, ,$@)) GOOS=$(word 2, $(subst _, ,$@))
          
          
          COMPONENTLIST := certificates identity multinode satellite storagenode storagenode-updater uplink versioncontrol
          OSARCHLIST    := linux_amd64 linux_arm linux_arm64 windows_amd64 freebsd_amd64
          BINARIES      := $(foreach C,$(COMPONENTLIST),$(foreach O,$(OSARCHLIST),$C_$O))
          .PHONY: binaries
          binaries: ${BINARIES} ## Build certificates, identity, multinode, satellite, storagenode, uplink, versioncontrol and multinode binaries (jenkins)
          
          .PHONY: sign-windows-installer
          sign-windows-installer:
          	storj-sign release/${TAG}/storagenode_windows_amd64.msi
          
          ##@ Deploy

brainstorm · June 30, 2024, 9:36am

BTW, I restarted yesterday storagenode with these settings

storage2.piece-scan-on-startup: true
pieces.enable-lazy-filewalker: false

My idea was to force it to do the filewalker process at startup and do it as fast as possible, not throttling IO and see if after that CPU use would just fall of a cliff.

It seems to have just finished the filewalking. CPU use for storagenode just went back to normal very light load.

I’ll make more experiments to confirm but so far I think my theory is right. Reason why ? no idea. I am going to setup another node to keep investigating this.

brainstorm · June 30, 2024, 9:45am

Why I think this is quite important also from an economic point of view is that as soon as the cpu load went back to normal:

my power use went down 30%
download rate went up 100%
storj pieces download went from 85% to 99%
storj pieces upload went from 76% to 99%

ad329 · July 17, 2024, 7:57am

Don’t have much to contribute but just wanted to mention that I’m seeing the same issue with docker + zfs. Seems like high CPU usage (though my disks are absolutely iops constrained, I’m running 2 nodes on a raidz1 so the write iops are struggling).

ad329 · July 17, 2024, 9:10am

Out of curiosity, @brainstorm are you also iops constrained? Wondering if it has to do with that. It shouldn’t consume cpu cycles though, the scheduler should offload iowaiting so not totally sure. Might be worth profiling.

Alexey · July 20, 2024, 11:52am

I believe, that it’s related to a 1.105.x release:

brainstorm · July 27, 2024, 12:43pm

possible, I am currently on v1.108.3 and not seeing this issue anymore.

brainstorm · July 27, 2024, 12:45pm

I do not have an SSD for the log cache etc. but those are fast SAS drives. 300-500iops
I do not care if that part is slow, it should just not spike CPU while effectively doing nothing.