atop can show disk utilization and bandwidth btw. I like it.
it doesnât show IO wait too⌠Hm⌠not an expert of FreeBSD, but looks like uwait
against the storagenode
process does mean exactly IO wait, so all these WCPUs are used to perform some IO operations. Perhaps zfs?
I can only invite @arrogantrabbit , maybe they know better.
well it does not make sense it would be IO operations, I can see at those times read/write on the zfs pool is rather low, 3-5MB/s and network IO low as well (5-20Mb/s). (sysutils/ioztat shows it well)
During last weekâs tests, I was getting continuous 15-20MB/s disk IO and 150Mb/s network, all with 0.02 cpu load.
I agree that since all the extra CPU load is âsystemâ that must mean the storagenode process is causing some kind of churn in system processes.
Yeah, but my nodes doesnât have a high CPU usage on any loadâŚ
However, they are on a worst OS - Windows. The one is a Windows service and two other - the Docker Desktop for Windows (on WSL2, but still - a Linux VM)âŚ
But no any RAID. So, for me itâs the only explanation.
right, I think itâs totally not normal. It was going fine with no load at all during that whole week of tests. But nothing changed in that system so I wonder what it could be. still diggingâŚ
top in iomode does not even look bad
CPU: 2.8% user, 0.0% nice, 61.9% system, 0.0% interrupt, 35.3% idle
Mem: 1378M Active, 304G Inact, 1423M Laundry, 59G Wired, 229M Buf, 8304M Free
ARC: 33G Total, 13G MFU, 6385M MRU, 26M Anon, 396M Header, 13G Other
5425M Compressed, 23G Uncompressed, 4.26:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse
PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
7390 storagenod 11479 1648 76 364 0 440 42.07% storagenode
63355 88 3790138 142 4 98 0 102 9.75% mysqld
73892 root 198 333 103 0 0 103 9.85% find
92235 storagenod 393 175 5 0 0 5 0.48% storagenode
25940 hostd 1255 73 2 243 0 245 23.42% hostd
12952 zabbix 62 12 0 0 0 0 0.00% zabbix_server
12965 zabbix 21 0 0 0 0 0 0.00% zabbix_server
7389 root 145 0 0 150 0 150 14.34% daemon
88576 bruno 2 1 0 0 0 0 0.00% top
11729 zabbix 6 0 0 0 0 0 0.00% zabbix_agentd
12992 zabbix 16 0 0 0 0 0 0.00% zabbix_server
12993 zabbix 12 0 0 0 0 0 0.00% zabbix_server
12962 zabbix 12 0 0 0 0 0 0.00% zabbix_server
12954 zabbix 31 0 0 0 0 0 0.00% zabbix_server
12944 zabbix 19 0 0 0 0 0 0.00% zabbix_server
12963 zabbix 22 0 0 0 0 0 0.00% zabbix_server
12983 zabbix 12 0 0 0 0 0 0.00% zabbix_server
12996 zabbix 4 0 0 0 0 0 0.00% zabbix_server
11720 zabbix 2 0 0 0 0 0 0.00% zabbix_agentd
12966 zabbix 6 0 0 0 0 0 0.00% zabbix_server
26509 zabbix 2 0 0 0 0 0 0.00% zabbix_agentd
Few avenues.
-
Top in io mode (run top, then press m), or run
top -m io
shows IO per process -
iostat -w 1 -n 10
-
There are all sorts of pre-made dtrace scripts
/usr/share/dtrace/
, for example,/usr/share/dtrace/disklatency
maybe what you need. The whole script is literally a paragraph worh of text, you can modify it to your hearts content. For example, you might want to adjust quantization if you have very fast disksdisklatency
#pragma D option quiet #pragma D option dynvarsize=16m dtrace:::BEGIN { printf("Tracing... Hit Ctrl-C to end.\n"); } io:::start { start_time[arg0] = timestamp; } io:::done /this->start = start_time[arg0]/ { this->delta = (timestamp - this->start) / 1000; @q[args[1]->device_name, args[1]->unit_number] = lquantize(this->delta, 4000, 80000, 4000); @max[args[1]->device_name, args[1]->unit_number] = max(this->delta); @avg[args[1]->device_name, args[1]->unit_number] = avg(this->delta); @stddev[args[1]->device_name, args[1]->unit_number] = stddev(this->delta); start_time[arg0] = 0; } tick-10s { printa(" %s (%d), us:\n%@d\n", @q); printa("max%s (%d), us:\n%@d\n", @max); printa("avg%s (%d), us:\n%@d\n", @avg); printa("stddev%s (%d), us:\n%@d\n", @stddev); clear(@q); clear(@max); clear(@avg); clear(@stddev); }
I canât recommend this enough: DTrace Tools. If you want to learn about the system and whatâs going on there â dtrace is your friend.
thanks, I forgot about Dtrace⌠time to learn a bit more about it.
top in io mode does not look bad to me, I mean there is relatively low activity.
I wonder if storagenode is busy waiting on something
I found I had messed up zfs datasets mountpoints in the jail, and both databases and data were on the same datasets. still I do not think that would account for the huge cpu load.
That said, after I restarted the jail with proper config, cpu load was back to 0.06⌠with good disk IO and network bandwidth (100Mb/s). That stayed almost the same for about 22 hours, slowly climbing to about 0.09. Then, over 3 hours, it climbed back to 0.8-1.0 cpu load. Itâs like there is a slow âcpu leakâ that suddenly goes boom
I found what is triggering the issue and have a workaround.
However, I believe the root cause may be a bug in storagenode, here is why.
What was happening
When the daily security check runs, it goes over all files and filesystems checking setuid bit that have changed. This executes find, which churns very fast high number of files (or just metadata?). This makes the storagenode process to increase CPU use dramatically.
Workaround
disable the daily security check, kill currently running related processes.
Why this may be a bug in storagenode
The overall IO rate by the security check was quite low. While running, the system was not IO starved. I could run big file io, network transfers, with lower performance of course, but could still do it at rates 3-8 times what storagenode would be doing.
If starved for IO, one can imagine any process would spend more time waiting on write/read calls to return, maybe sleeping in between, or using select/poll to be notified when to further perform IO. In such case, the process would simply have much lower performance and not change cpu load (instead it would lower the load). For storagenode that would translate to lower disk and network IO rate, and a lot more idle time (sleeping, or blocked on system calls)
However this is not what is observed. What was observed is consistent with busy waiting. The storagenode process was constantly in the âuwaitâ process state. As if instead of pulling back, moderating bandwidth, sleeping more, storagenode instead loops at a very fast rate on system calls that fail or return a retry error, or just polls on some event or mutex, thereby consuming high CPU while effectively doing nothing (which is also observed: lower performance, lower overall IO, more missed uploads and downloads).
I think this warrants further investigating as it might not be visible in all or most setups, or maybe not on every operating system, but it might just be because the right conditions to trigger it have not been met.
Of course, it might be an issue with the libraries used by storagenode, or specific porting issues to FreeBSD. I have no idea at this point. What I am sure is: this is not a normal behavior and itâs clearly triggered by something that should to the very least, only lead to some relative performance degradation, and not pegging the CPU like observed.
Well I spoke (half) victory too soon. The issue is back.
Ran fine for a while after moderating the filesystem traversal for security checks, then new continuous CPU spike appeared. Even though none of the culprit scripts are running, and killing them had an immediate positive effect.
I believe the cause is very simple: the storagenode filewalkers are doing exactly the same thing the security scripts were doing: whole file system traversal. So they trigger the same issue. If I restart storagenode, it goes fine with low CPU use, then after a while the filewalker really kicks in (as seen in itâs cpu and file io usage increasing) and that is when storagenode cpu spikes again.
There is really a bug in there. I hope the developers look into this because it will certainly cause issues sooner or later.
@arrogantrabbit since you run FreeBSD too, you do not notice this behavior at all ?
With all metadata on SSD, and used space filewalker enabled on startup, on restart node will run full speed, reading at 5-10kIOPS peak, exponentially decreasing to normal, as metadata is getting cached, and using about 80-100% of a cpu, for about 10 min. This is the only time I saw storagenode exceed 5% cpu utilization.
Here is my old comment about it Massive reads on startup - #30 by arrogantrabbit, from a different server
do you use the lazy filewalker setting ?
so, with a normal disk doing say 200IOPS we could imagine seeing that go for a few hours.
I am going to try that.
pieces.enable-lazy-filewalker: false
storage2.piece-scan-on-startup: true
and see if the cpu load subsides after a couple hours
I also set sync=disabled on the zfs dataset with the storj data
I leave everything at default. Today default is to not run filewalker at startup. Those plots were from when default was the opposite, and lazy filewalker didnât exist
Some people reported multiple days of sustained high disk IO. Node still needs to serve data, and there are not much leftover from those 200iops disks can provide.
seems like bad design and deficient implementation.
Already what I highlighted above ( Very hich cpu utilization since 24 june, although BW 3 times less - #30 by brainstorm ) is indicative of bad behavior in the storagenode process, doing some busy polling.
In addition that filewalker business seems redundant to me but I do not know the internals so who knows really.
Effectively disabling the separate lazy filewalker process sent cpu use through the roof here. Weâll see if it actually finishes in a reasonable time. I only have 12TB on that node BTW.
(disabling sync quadrupled IOps on that pool, up to ~984)
I donât think there is any polling, do you have evidence? e.g. callstacks? If we speculate for a second that they indeed use spinlocks â and lot of locks implementations of attempt a spinlock, to avoid roundtrip to kernel â an if you see it spinning all the time (which needs to be confirmed) â something else is wrong in the system where something else uses a lot of time causing unnecessary contention.
Spinlock is almost always preferable to mutex in a well optimized system. So the solution is not replacing them with murexes and sleeping the process, solution is opimizing the system to remove contention.
Enumerating files takes time. Disabling lazy file walker does not increase cpu usage. enabling used space file walker does, if you have enough ram and/or fast ssd.
You can emulate what file walker is doing approximately with time find . -type f -print0 | xargs -0 -P 24 -n 100 stat -f "%z"
.
If this is slow â storagenode will waste a lot of time waiting. there is nothing it can do. CPU usage is the lest of the problems here.
There is a variety of profiling tools for go programs, I would start there to collect evidence. I did not see any inefficiency with storagenode yet, I donât even notice its running.
I donât have evidence like lines of code looping and polling stuff. Itâs just what makes sense given the behavior I described here ( Very hich cpu utilization since 24 june, although BW 3 times less - #30 by brainstorm ).
It seem logical that if the process is basically waiting for IO, it should not be using up CPU. Besides, we see it is always in uwait state. So, if itâs waiting on system calls, it would not be using up CPU. But it does. So it probably means itâs repeatedly calling and returning immediately, wasting CPU cycles. But maybe I am not understanding something hereâŚ
I disagree that cpu usage is the least of the problems. Itâs a big part of what makes or breaks the whole system as it increases energy use (by a lot in this case). If you spend more energy waiting than doing actual work (i.e. downloads/uploads) then a node is that much less profitable. And the node is supposed to be running on what we have already, so just be another innocuous light process.
And, if like you say enumerating files is slow, storagenode should waste a lot of time waiting, not using up CPU time (and energy). But that is not what we see here, we see huge cpu use.
To me that is clearly at least an implementation issue.
I mean the bottom line is: any well written process that does IO (disk and network) if he becomes IO starved, would just decrease throughput. Maybe have a small hit for overhead to scale throughput or repeat calls. Instead, here throughput decreases and CPU increases dramatically. That is a sign of busy polling/waiting.
Also, this was revealed by another process iterating over the filesystem, completely independent of the storagenode process.
I do not know what you mean by âslowâ with your example of "simulating the file walker, so itâs hard to say. Do you mean that command would usually complete in the time a filewalked completes its run ?
Regarding spinlock, there are specific instances where they can be more efficient than a mutex, and the biggest I know is when waiting on very short times, to avoid the overhead of a thread being rescheduled. Otherwise, a spinlock is exactly busy waiting which is just wasting CPU cycles. (just what we see here). Not sure what the io pattern of storagenode is, but again, the fact that its cpu use goes up instead of down, when it effectively becomes less busy (because IO is slow) is telling that a busy wait is not the right thing to do here.
I have used gprof and the intel vtune in the past, but I am not yet down the source profiling road on this.
I think storagenode is doing something clearly wrong as it should not increase its cpu use when IO starved. You seem to say that it just means the system overall is unoptimized. Which I agree with, it certainly is. I just think that is just throwing resources at a problem and not solving the root issue, namely: it appears that storagenode behaves badly.
And BTW, this is on a machine where the storagenode load was completely unnoticeable when storagenode was doing properly its thing, i.e. no filewalker running, and when there was no other disk churning. So, the hardware is perfectly able to sustain whatever IO storagenode needs.
It should just be a lot more graceful.
Why is there such a thing as the filewalker anyway ? We store stuff. We delete stuff. We track everything in databases. Something seems very off to me if it needs to traverse the filesystem entirely from time to time. But that is a different problem.
Iâ address your other points later tonight, but on this one: original version of Storj used databases for everything, including to track stored pieces and it was a disaster. People would run it on crap consumer hardware with no ups and it was getting corrupted all the time. Since then storj moved to relying exclusively on filesystem features for whatâs important. I.e. both storing and tracking whatâs stored. The databases are with a very few(insignificant from data integrity perspective) exceptions are used to draw pretty plots in the UI.
People continue running nodes on shit systems (see how many âmy database corruptedâ threads are here â and yet, in past 20 years Iâm yet to see one sqlite database corruption in my daily use) so the file walker is needed to update the statistics. The orders database and expiration database are the ones actually useful, but still not critical â if you nuke them data wonât be lost, you will end up storing more garbage for longer.
In my opinion itâs 100% waste of time. All these stats are useless. If there would be an option to avoid wasting resources maintaining data in the databases and avoid rendering these silly plots I would turn it on immediately. Storagenode UI needs a green rectangle when all is OK, and red rectangle with logs beneath when itâs not. Thatâs it. Amount of IO saved will be tremendous, allowing potato nodes to keep up with the load for longer.
The filesystem scans will still be needed to delete garbage (see bloom filters) but filesystem is also a database, and if metadata is fast, itâs very quick process.
There are systems that can be interrupted at any moment and not be irremediably corrupted. Consumer hardware without UPS is not the root problem.
Iâd use databases to track everything, and the database can be regenerated from the stored data. And a few other things.
Between this, the âtrashâ issue, the bandwidth accounting, the not actually decentralized storage system (there are only 4 clients of the storage nodes, the 4 satellites) the long vetting process, and more âŚ
It seems there are lots of weird and hoaky things going on. growing pains I suppose.
Seems there is not much to do to solve my problem besides breaking out the source, debugger and profiler, because I do not think the devs view this as a bug, a problem, nor a priority.