I found what is triggering the issue and have a workaround.
However, I believe the root cause may be a bug in storagenode, here is why.
What was happening
When the daily security check runs, it goes over all files and filesystems checking setuid bit that have changed. This executes find, which churns very fast high number of files (or just metadata?). This makes the storagenode process to increase CPU use dramatically.
Workaround
disable the daily security check, kill currently running related processes.
Why this may be a bug in storagenode
The overall IO rate by the security check was quite low. While running, the system was not IO starved. I could run big file io, network transfers, with lower performance of course, but could still do it at rates 3-8 times what storagenode would be doing.
If starved for IO, one can imagine any process would spend more time waiting on write/read calls to return, maybe sleeping in between, or using select/poll to be notified when to further perform IO. In such case, the process would simply have much lower performance and not change cpu load (instead it would lower the load). For storagenode that would translate to lower disk and network IO rate, and a lot more idle time (sleeping, or blocked on system calls)
However this is not what is observed. What was observed is consistent with busy waiting. The storagenode process was constantly in the “uwait” process state. As if instead of pulling back, moderating bandwidth, sleeping more, storagenode instead loops at a very fast rate on system calls that fail or return a retry error, or just polls on some event or mutex, thereby consuming high CPU while effectively doing nothing (which is also observed: lower performance, lower overall IO, more missed uploads and downloads).
I think this warrants further investigating as it might not be visible in all or most setups, or maybe not on every operating system, but it might just be because the right conditions to trigger it have not been met.
Of course, it might be an issue with the libraries used by storagenode, or specific porting issues to FreeBSD. I have no idea at this point. What I am sure is: this is not a normal behavior and it’s clearly triggered by something that should to the very least, only lead to some relative performance degradation, and not pegging the CPU like observed.