Very hich cpu utilization since 24 june, although BW 3 times less

EasyRhino · June 26, 2024, 4:45pm

atop can show disk utilization and bandwidth btw. I like it.

Alexey · June 27, 2024, 4:53am

it doesn’t show IO wait too… Hm… not an expert of FreeBSD, but looks like uwait against the storagenode process does mean exactly IO wait, so all these WCPUs are used to perform some IO operations. Perhaps zfs?
I can only invite @arrogantrabbit , maybe they know better.

brainstorm · June 27, 2024, 7:26am

well it does not make sense it would be IO operations, I can see at those times read/write on the zfs pool is rather low, 3-5MB/s and network IO low as well (5-20Mb/s). (sysutils/ioztat shows it well)
During last week’s tests, I was getting continuous 15-20MB/s disk IO and 150Mb/s network, all with 0.02 cpu load.
I agree that since all the extra CPU load is “system” that must mean the storagenode process is causing some kind of churn in system processes.

Alexey · June 27, 2024, 7:59am

Yeah, but my nodes doesn’t have a high CPU usage on any load…
However, they are on a worst OS - Windows. The one is a Windows service and two other - the Docker Desktop for Windows (on WSL2, but still - a Linux VM)…
But no any RAID. So, for me it’s the only explanation.

brainstorm · June 27, 2024, 11:51am

right, I think it’s totally not normal. It was going fine with no load at all during that whole week of tests. But nothing changed in that system so I wonder what it could be. still digging…

brainstorm · June 27, 2024, 3:59pm

top in iomode does not even look bad

CPU:  2.8% user,  0.0% nice, 61.9% system,  0.0% interrupt, 35.3% idle
Mem: 1378M Active, 304G Inact, 1423M Laundry, 59G Wired, 229M Buf, 8304M Free
ARC: 33G Total, 13G MFU, 6385M MRU, 26M Anon, 396M Header, 13G Other
     5425M Compressed, 23G Uncompressed, 4.26:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
 7390 storagenod 11479   1648     76    364      0    440  42.07% storagenode
63355     88    3790138    142      4     98      0    102   9.75% mysqld
73892 root         198    333    103      0      0    103   9.85% find
92235 storagenod   393    175      5      0      0      5   0.48% storagenode
25940 hostd       1255     73      2    243      0    245  23.42% hostd
12952 zabbix        62     12      0      0      0      0   0.00% zabbix_server
12965 zabbix        21      0      0      0      0      0   0.00% zabbix_server
 7389 root         145      0      0    150      0    150  14.34% daemon
88576 bruno          2      1      0      0      0      0   0.00% top
11729 zabbix         6      0      0      0      0      0   0.00% zabbix_agentd
12992 zabbix        16      0      0      0      0      0   0.00% zabbix_server
12993 zabbix        12      0      0      0      0      0   0.00% zabbix_server
12962 zabbix        12      0      0      0      0      0   0.00% zabbix_server
12954 zabbix        31      0      0      0      0      0   0.00% zabbix_server
12944 zabbix        19      0      0      0      0      0   0.00% zabbix_server
12963 zabbix        22      0      0      0      0      0   0.00% zabbix_server
12983 zabbix        12      0      0      0      0      0   0.00% zabbix_server
12996 zabbix         4      0      0      0      0      0   0.00% zabbix_server
11720 zabbix         2      0      0      0      0      0   0.00% zabbix_agentd
12966 zabbix         6      0      0      0      0      0   0.00% zabbix_server
26509 zabbix         2      0      0      0      0      0   0.00% zabbix_agentd

arrogantrabbit · June 27, 2024, 8:03pm

Few avenues.

Top in io mode (run top, then press m), or run top -m io shows IO per process
iostat -w 1 -n 10

There are all sorts of pre-made dtrace scripts /usr/share/dtrace/ , for example, /usr/share/dtrace/disklatency maybe what you need. The whole script is literally a paragraph worh of text, you can modify it to your hearts content. For example, you might want to adjust quantization if you have very fast disks

disklatency

#pragma D option quiet
#pragma D option dynvarsize=16m

dtrace:::BEGIN
{
		printf("Tracing... Hit Ctrl-C to end.\n");
}

io:::start
{
		start_time[arg0] = timestamp;
}

io:::done
/this->start = start_time[arg0]/
{
		this->delta = (timestamp - this->start) / 1000;
		@q[args[1]->device_name, args[1]->unit_number] =
			lquantize(this->delta, 4000, 80000, 4000);
		@max[args[1]->device_name, args[1]->unit_number] = max(this->delta);
		@avg[args[1]->device_name, args[1]->unit_number] = avg(this->delta);
		@stddev[args[1]->device_name, args[1]->unit_number] = stddev(this->delta);
		start_time[arg0] = 0;
}

tick-10s
{
		printa("   %s (%d), us:\n%@d\n", @q);
		printa("max%s (%d), us:\n%@d\n", @max);
		printa("avg%s (%d), us:\n%@d\n", @avg);
		printa("stddev%s (%d), us:\n%@d\n", @stddev);

		clear(@q);
		clear(@max);
		clear(@avg);
		clear(@stddev);
}

I can’t recommend this enough: DTrace Tools. If you want to learn about the system and what’s going on there – dtrace is your friend.

brainstorm · June 27, 2024, 8:34pm

thanks, I forgot about Dtrace… time to learn a bit more about it.

top in io mode does not look bad to me, I mean there is relatively low activity.
I wonder if storagenode is busy waiting on something

brainstorm · June 29, 2024, 8:06am

I found I had messed up zfs datasets mountpoints in the jail, and both databases and data were on the same datasets. still I do not think that would account for the huge cpu load.

That said, after I restarted the jail with proper config, cpu load was back to 0.06… with good disk IO and network bandwidth (100Mb/s). That stayed almost the same for about 22 hours, slowly climbing to about 0.09. Then, over 3 hours, it climbed back to 0.8-1.0 cpu load. It’s like there is a slow “cpu leak” that suddenly goes boom

brainstorm · June 29, 2024, 1:49pm

I found what is triggering the issue and have a workaround.

However, I believe the root cause may be a bug in storagenode, here is why.

What was happening

When the daily security check runs, it goes over all files and filesystems checking setuid bit that have changed. This executes find, which churns very fast high number of files (or just metadata?). This makes the storagenode process to increase CPU use dramatically.

Workaround
disable the daily security check, kill currently running related processes.

Why this may be a bug in storagenode

The overall IO rate by the security check was quite low. While running, the system was not IO starved. I could run big file io, network transfers, with lower performance of course, but could still do it at rates 3-8 times what storagenode would be doing.
If starved for IO, one can imagine any process would spend more time waiting on write/read calls to return, maybe sleeping in between, or using select/poll to be notified when to further perform IO. In such case, the process would simply have much lower performance and not change cpu load (instead it would lower the load). For storagenode that would translate to lower disk and network IO rate, and a lot more idle time (sleeping, or blocked on system calls)

However this is not what is observed. What was observed is consistent with busy waiting. The storagenode process was constantly in the “uwait” process state. As if instead of pulling back, moderating bandwidth, sleeping more, storagenode instead loops at a very fast rate on system calls that fail or return a retry error, or just polls on some event or mutex, thereby consuming high CPU while effectively doing nothing (which is also observed: lower performance, lower overall IO, more missed uploads and downloads).

I think this warrants further investigating as it might not be visible in all or most setups, or maybe not on every operating system, but it might just be because the right conditions to trigger it have not been met.

Of course, it might be an issue with the libraries used by storagenode, or specific porting issues to FreeBSD. I have no idea at this point. What I am sure is: this is not a normal behavior and it’s clearly triggered by something that should to the very least, only lead to some relative performance degradation, and not pegging the CPU like observed.

brainstorm · June 29, 2024, 2:49pm

Well I spoke (half) victory too soon. The issue is back.

Ran fine for a while after moderating the filesystem traversal for security checks, then new continuous CPU spike appeared. Even though none of the culprit scripts are running, and killing them had an immediate positive effect.

I believe the cause is very simple: the storagenode filewalkers are doing exactly the same thing the security scripts were doing: whole file system traversal. So they trigger the same issue. If I restart storagenode, it goes fine with low CPU use, then after a while the filewalker really kicks in (as seen in it’s cpu and file io usage increasing) and that is when storagenode cpu spikes again.

There is really a bug in there. I hope the developers look into this because it will certainly cause issues sooner or later.

brainstorm · June 29, 2024, 3:14pm

@arrogantrabbit since you run FreeBSD too, you do not notice this behavior at all ?

arrogantrabbit · June 29, 2024, 4:05pm

With all metadata on SSD, and used space filewalker enabled on startup, on restart node will run full speed, reading at 5-10kIOPS peak, exponentially decreasing to normal, as metadata is getting cached, and using about 80-100% of a cpu, for about 10 min. This is the only time I saw storagenode exceed 5% cpu utilization.

Here is my old comment about it Massive reads on startup - #30 by arrogantrabbit, from a different server

brainstorm · June 29, 2024, 4:08pm

do you use the lazy filewalker setting ?

so, with a normal disk doing say 200IOPS we could imagine seeing that go for a few hours.
I am going to try that.
pieces.enable-lazy-filewalker: false
storage2.piece-scan-on-startup: true

and see if the cpu load subsides after a couple hours

I also set sync=disabled on the zfs dataset with the storj data

arrogantrabbit · June 29, 2024, 4:25pm

I leave everything at default. Today default is to not run filewalker at startup. Those plots were from when default was the opposite, and lazy filewalker didn’t exist

Some people reported multiple days of sustained high disk IO. Node still needs to serve data, and there are not much leftover from those 200iops disks can provide.

brainstorm · June 29, 2024, 4:34pm

seems like bad design and deficient implementation.
Already what I highlighted above ( Very hich cpu utilization since 24 june, although BW 3 times less - #30 by brainstorm ) is indicative of bad behavior in the storagenode process, doing some busy polling.

In addition that filewalker business seems redundant to me but I do not know the internals so who knows really.

Effectively disabling the separate lazy filewalker process sent cpu use through the roof here. We’ll see if it actually finishes in a reasonable time. I only have 12TB on that node BTW.

(disabling sync quadrupled IOps on that pool, up to ~984)

arrogantrabbit · June 29, 2024, 4:52pm

I don’t think there is any polling, do you have evidence? e.g. callstacks? If we speculate for a second that they indeed use spinlocks – and lot of locks implementations of attempt a spinlock, to avoid roundtrip to kernel – an if you see it spinning all the time (which needs to be confirmed) – something else is wrong in the system where something else uses a lot of time causing unnecessary contention.

Spinlock is almost always preferable to mutex in a well optimized system. So the solution is not replacing them with murexes and sleeping the process, solution is opimizing the system to remove contention.

Enumerating files takes time. Disabling lazy file walker does not increase cpu usage. enabling used space file walker does, if you have enough ram and/or fast ssd.

You can emulate what file walker is doing approximately with time find . -type f -print0 | xargs -0 -P 24 -n 100 stat -f "%z".

If this is slow – storagenode will waste a lot of time waiting. there is nothing it can do. CPU usage is the lest of the problems here.

There is a variety of profiling tools for go programs, I would start there to collect evidence. I did not see any inefficiency with storagenode yet, I don’t even notice its running.

brainstorm · June 29, 2024, 5:47pm

I don’t have evidence like lines of code looping and polling stuff. It’s just what makes sense given the behavior I described here ( Very hich cpu utilization since 24 june, although BW 3 times less - #30 by brainstorm ).

It seem logical that if the process is basically waiting for IO, it should not be using up CPU. Besides, we see it is always in uwait state. So, if it’s waiting on system calls, it would not be using up CPU. But it does. So it probably means it’s repeatedly calling and returning immediately, wasting CPU cycles. But maybe I am not understanding something here…

I disagree that cpu usage is the least of the problems. It’s a big part of what makes or breaks the whole system as it increases energy use (by a lot in this case). If you spend more energy waiting than doing actual work (i.e. downloads/uploads) then a node is that much less profitable. And the node is supposed to be running on what we have already, so just be another innocuous light process.

And, if like you say enumerating files is slow, storagenode should waste a lot of time waiting, not using up CPU time (and energy). But that is not what we see here, we see huge cpu use.

To me that is clearly at least an implementation issue.

I mean the bottom line is: any well written process that does IO (disk and network) if he becomes IO starved, would just decrease throughput. Maybe have a small hit for overhead to scale throughput or repeat calls. Instead, here throughput decreases and CPU increases dramatically. That is a sign of busy polling/waiting.
Also, this was revealed by another process iterating over the filesystem, completely independent of the storagenode process.

I do not know what you mean by “slow” with your example of "simulating the file walker, so it’s hard to say. Do you mean that command would usually complete in the time a filewalked completes its run ?

Regarding spinlock, there are specific instances where they can be more efficient than a mutex, and the biggest I know is when waiting on very short times, to avoid the overhead of a thread being rescheduled. Otherwise, a spinlock is exactly busy waiting which is just wasting CPU cycles. (just what we see here). Not sure what the io pattern of storagenode is, but again, the fact that its cpu use goes up instead of down, when it effectively becomes less busy (because IO is slow) is telling that a busy wait is not the right thing to do here.

I have used gprof and the intel vtune in the past, but I am not yet down the source profiling road on this.

I think storagenode is doing something clearly wrong as it should not increase its cpu use when IO starved. You seem to say that it just means the system overall is unoptimized. Which I agree with, it certainly is. I just think that is just throwing resources at a problem and not solving the root issue, namely: it appears that storagenode behaves badly.

And BTW, this is on a machine where the storagenode load was completely unnoticeable when storagenode was doing properly its thing, i.e. no filewalker running, and when there was no other disk churning. So, the hardware is perfectly able to sustain whatever IO storagenode needs.
It should just be a lot more graceful.

Why is there such a thing as the filewalker anyway ? We store stuff. We delete stuff. We track everything in databases. Something seems very off to me if it needs to traverse the filesystem entirely from time to time. But that is a different problem.

arrogantrabbit · June 29, 2024, 6:49pm

I’ address your other points later tonight, but on this one: original version of Storj used databases for everything, including to track stored pieces and it was a disaster. People would run it on crap consumer hardware with no ups and it was getting corrupted all the time. Since then storj moved to relying exclusively on filesystem features for what’s important. I.e. both storing and tracking what’s stored. The databases are with a very few(insignificant from data integrity perspective) exceptions are used to draw pretty plots in the UI.
People continue running nodes on shit systems (see how many “my database corrupted” threads are here – and yet, in past 20 years I’m yet to see one sqlite database corruption in my daily use) so the file walker is needed to update the statistics. The orders database and expiration database are the ones actually useful, but still not critical – if you nuke them data won’t be lost, you will end up storing more garbage for longer.

In my opinion it’s 100% waste of time. All these stats are useless. If there would be an option to avoid wasting resources maintaining data in the databases and avoid rendering these silly plots I would turn it on immediately. Storagenode UI needs a green rectangle when all is OK, and red rectangle with logs beneath when it’s not. That’s it. Amount of IO saved will be tremendous, allowing potato nodes to keep up with the load for longer.

The filesystem scans will still be needed to delete garbage (see bloom filters) but filesystem is also a database, and if metadata is fast, it’s very quick process.

brainstorm · June 29, 2024, 7:06pm

There are systems that can be interrupted at any moment and not be irremediably corrupted. Consumer hardware without UPS is not the root problem.
I’d use databases to track everything, and the database can be regenerated from the stored data. And a few other things.

Between this, the “trash” issue, the bandwidth accounting, the not actually decentralized storage system (there are only 4 clients of the storage nodes, the 4 satellites) the long vetting process, and more …

It seems there are lots of weird and hoaky things going on. growing pains I suppose.

Seems there is not much to do to solve my problem besides breaking out the source, debugger and profiler, because I do not think the devs view this as a bug, a problem, nor a priority.