Power management: Intel Xeon E5 v3 vs v4: Idle-ish power consumption

arrogantrabbit · June 12, 2025, 6:19pm

Does it not support mounting over network? Do you have to copy image onto it? If so - it’s pretty lazy design.

Well well well. I think these toys, including the “gaming” desktops and motherboards are e-waste the second they roll off the conveyer, unlike the 10-year old server or workstation board that will work for another 50 and that was designed with reliability in mind.

Understand, that any product is designed with specific goals and compromises in mind.

With “gaming” anything the goal is price and marketing: therefore quality of components and design decisions are driven by cost cutting and addition of gimmicks; They don’t need to make sense, they just need to look impressive on the box and YouTube videos. That’s how you get 10 PCE slots that are either x1 or have x1 wired, and M2 ports sharing lanes with sata connection and two other PCIE slot, without switching, so you can only use one or other. Because PCE switches are expensive, and it’s built for price – but connecting everything in exclusive mode allows to pad feature list on the box. Features that are technically there but cannot be used.

Long term reliability is not a factor whatsoever: gamers are expected to buy a new system annually. Because it’s faster, shinier, has Bluetooth Pro 3000. It’s a definition of e-waste.

Comparing this to server (or workstation) boards – those are build for performance, reliability, and longevity: server board are often optimized for multiple concurrent tasks and multi-core low clock processors; Workstation board prioritize performance and human-specific features – like onoard audio codec. These workflows need performance, and companies don’t upgrade their hardware annually. A lot of workstation boards by the way often include IPMI.

So my advice would be to buy workstation motherboard for your “gaming” pc. Not “gaming” consumer e-waste.

It’s called SoC. All BMC are build like that – including the most popular one ASPEED - AST2600

These aliexperss contraptions can never even come close to a BMC in features and functionality, they are not worth even $5; I’d argue they offer negative value.

RecklessD · June 13, 2025, 2:15am

At home, for the few times I would use it, I can live without iLO. If I was still working, I would have a nano one in my kit.

To each there own.
I’ve Built a few units based around Machinist MR9x boards, using a E5 v4 processor, ddr4 ECC ram. All 40 lanes exposed on board, gives me all the I/o I require (10G Nic, HBA), decent memory capacity. All been solid and reliable so far.

arrogantrabbit · June 13, 2025, 7:52am

Side quest into processor interrupt affinity.

TLDR – unsuccessful.

Noting that the ethenet controller, and two special device SSD constantly are bugging the processors with interrupts, supposedly preventign them from entering or staying in deeper C states, I thought – what will happen if we move those devices to PICE slots connected to the same processor, set interrupt affinities to only bug one of them, and also set processor affiniites on storj jails – thus hopefully offloading work from the first processor, allowing it to sleep longer?

So, I moved 4 devices in question to PCI slots connected to CPU1, according to the mlb manual. Device names are nvme2, nvme1, mpr0, igc0.

while IFS= read -r device; do
  echo $device | sed -nE 's/^(.*): <(.*)>.*(numa-domain [0-9]+) .*$/\1: \3: \2/p'
done <<< "$(dmesg | egrep '(nvme2|nvme1|mpr0|igc0).*numa-domain')"

The output confirms they are on the same numa domain:

nvme1: numa-domain 1: Intel DC PC3500
nvme2: numa-domain 1: Intel DC PC3500
igc0: numa-domain 1: Intel(R) Ethernet Controller I225-V
mpr0: numa-domain 1: Avago Technologies (LSI) SAS3008

Next, lets print device affinities and interrupt rates:

# Printing their affinities
while IFS= read -r device; do
#   echo $device
  irq="$(echo $device | sed -nE 's/^irq([0-9]+):.*$/\1/p')"
  name="$(echo $device | sed -nE 's/^.*: ([^ ]+).*$/\1/p')"
  stats="$(cpuset -g -x "$irq")"
  rate="$(echo $device | sed -nE 's/^.* ([0-9]+)$/\1/p')"
  printf "%-11s %-11s | %s\n" "$name" "($rate tps)" "$stats" 
done <<< "$(vmstat -i | egrep '(nvme2|nvme1|mpr0|igc0)')"

yields:

vme1:admin (0 tps) | irq 99 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
nvme1:io0 (141 tps) | irq 100 mask: 0
nvme1:io1 (122 tps) | irq 101 mask: 2
nvme1:io2 (123 tps) | irq 102 mask: 4
nvme1:io3 (123 tps) | irq 103 mask: 6
nvme1:io4 (105 tps) | irq 104 mask: 8
nvme1:io5 (109 tps) | irq 105 mask: 10
nvme1:io6 (136 tps) | irq 106 mask: 12
nvme1:io7 (125 tps) | irq 107 mask: 14
nvme2:admin (0 tps) | irq 108 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
nvme2:io0 (141 tps) | irq 109 mask: 1
nvme2:io1 (122 tps) | irq 110 mask: 3
nvme2:io2 (123 tps) | irq 111 mask: 5
nvme2:io3 (123 tps) | irq 112 mask: 7
nvme2:io4 (105 tps) | irq 113 mask: 9
nvme2:io5 (109 tps) | irq 114 mask: 11
nvme2:io6 (137 tps) | irq 115 mask: 13
nvme2:io7 (125 tps) | irq 116 mask: 15
igc0:rxq0 (1963 tps) | irq 117 mask: 8
igc0:rxq1 (1540 tps) | irq 118 mask: 10
igc0:rxq2 (974 tps) | irq 119 mask: 12
igc0:rxq3 (1180 tps) | irq 120 mask: 14
igc0:aq (0 tps) | irq 121 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
mpr0 (173 tps) | irq 122 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

As you can see, the nvme drives are bugging both CPUs, as does the HBA.

Unfortunately, I discoverd that I cannot assing interrupt affinity to a list or range of cores – the choice is either one or all.

For example, this fails:

% sudo cpuset -l '8,15' -x 122
cpuset: setaffinity: Invalid argument
% sudo cpuset -l '8-15' -x 122
cpuset: setaffinity: Invalid argument

but this works:

% sudo cpuset -l '8' -x 122
% sudo cpuset -l 'all' -x 122

That’s fine, after carefully randomly picking numbers from range 8-15 I ended up setting affinities as follows:

sudo cpuset -l '12' -x 100
sudo cpuset -l '13' -x 101
sudo cpuset -l '14' -x 102
sudo cpuset -l '15' -x 103
sudo cpuset -l '8' -x 109
sudo cpuset -l '9' -x 110
sudo cpuset -l '10' -x 111
sudo cpuset -l '11' -x 112
sudo cpuset -l '15' -x 122

Thus all the listed devices now bug CPU1. I’ve then also set stoj jail affinity to the same CPU:

% jls | grep storj
     4                  storj                         /mnt/pool1/iocage/jails/storj/root
% sudo cpuset -l '8-15' -j 4

The load visibly shifted to CPU1, but no perceptible difference in power consumption occured

I’m suspecting it’s the QPI, part of Uncore, to blame – I guess it has to be alive when any processor is touching memory. This will prevent package from entering deeper deeper C state, albeit idleness of the processor cores themselves might have allowed sleep.

I’m out of my depth here, and out of ideas. Maybe CPU experts can chime in and tell me that I’m digging in the wrong direction, and I shall ~~stop kicking the dead horse~~ defenestrate 10 year old platform and buy a modern E-2400 based solution if I want to shave off a few more watts.

Ottetal · June 13, 2025, 8:03am

Maybe you can’t save power with less interrupts to CPU2; but if all add-in are successfully moved to CPU1, you can remove CPU2 for immediate power savings

arrogantrabbit · June 13, 2025, 8:15am

On CPU0 there are one too few PCIE slots… Perhaps I can drop SLOG (it’s not being used too often, mostly with time machine, and It’s ok if it takes longer… hmmmm

arrogantrabbit · June 13, 2025, 9:05am

Small win in the meantime.

Moderating interrupts of the ethernet controller by further increasing interrupt delaty from 128 to 250 microseconds

sudo sysctl dev.igc.0.rx_int_delay=250
dev.igc.0.rx_int_delay: 128 -> 250
sudo sysctl dev.igc.0.tx_int_delay=250
dev.igc.0.tx_int_delay: 128 -> 250

resultied in almost halving the interrupt rate:

From

igc0:rxq0 (1963 tps) | irq 117 mask: 8
igc0:rxq1 (1540 tps) | irq 118 mask: 10
igc0:rxq2 (974 tps) | irq 119 mask: 12
igc0:rxq3 (1180 tps) | irq 120 mask: 14

to

igc0:rxq0   (994 tps)   | irq 117 mask: 8
igc0:rxq1   (553 tps)   | irq 118 mask: 10
igc0:rxq2   (819 tps)   | irq 119 mask: 12
igc0:rxq3   (170 tps)   | irq 120 mask: 14

and about 5W of exta power savings (averaged over 10 min observation period, after nodes are done filewalking). That’s pretty interesting. Not much – but free 5w is nothign to scoff about.

Pretty low hanging fruit to tackle. I’m wondering how does this work on linux, and how much power can all SNO collectively save.

RecklessD · June 13, 2025, 9:48am

Interesting. Down from how many watt?

Network performance changed?
What effect does this have at high load?

RecklessD · June 13, 2025, 10:51am

seems like there is no mechanism on Linux to control interrupt rate.

EasyRhino · June 13, 2025, 3:52pm

If it changed 5W on a single NIC that’s an astounding drop. A single NIC only uses like… less than 10w.

arrogantrabbit · June 13, 2025, 4:43pm

No, the nic itself uses less than 2W. It has no heatsinks and is very cold. It’s a fairly modern i225-v rev 3.

The savings are supposedly due to reducing the interrupts rate allows processors(s) to stay in low power states for longer.

arrogantrabbit · June 13, 2025, 4:58pm

It went under that specific load from 192 to 189; it was higher traffic than usual, but the results were repeatable: I would set the values, nic down/up and monitor. Then change back and repeat.

Not perceptibly. Extra 125 (and even half of that on average) microseconds of latency won’t make any difference. It’s a storage server after all. If it was high frequency trading platform — then sure. Every microsecond would count

Copying large files I did not see any difference. Again, I don’t think it’s possible to measure that on a macro level.

That is a driver/hardware feature. Both nic hardware and driver must supprt that.

asturking · June 13, 2025, 9:38pm

E5-2630L V4 is your friend.
Less than 20€ and it have 10c 20t, 2.90 GHz turbo - 1.80 GHz base. 55w

With your numbers (kw price) its the best for you from my point of view

I hope I have helped you, unlike you, who demotivates everyone.

arrogantrabbit · June 13, 2025, 10:37pm

That would be a problem. My previous processor was 2.5GHz and it was not enough to provide sufficient bandwidth for a single threaded samba transfers. Hence the goal was to get a high boost clock CPU, which inevitably would be low core count. But low core count is not a problem in home environment where you don’t need to serve 30 concurrent clients.

Next, lower TDP is not necessarily indicative of lower idle power consumption. On the contrary, it may be counterproductive to use capped TDP processor if power consumption at light load is a concern: see “race to idle” concept. Core in a higher TDP processor may be able to accomplish the job faster and go back to sleep, allowing the whole package to sleep sooner, thus saving more power.

The best would be modern CPU and chipset — 2400 or maybe epyc 4004 or similar, with much higher IPC and more aggressive power management. Maybe I’ll eventually get there.

Whoa — where did it come from?

arrogantrabbit · June 16, 2025, 4:26am

Continuing analyzing C-States.

I've built Intel PCM for FreeBSD in the jail

git clone --recursive https://github.com/intel/pcm
mkdir -p pcm/build
cd pcm/build

# Hackery to unfuck cmake in my old crusty temporrary jail, you may not need to do this
ln -s /usr/local/lib/libjsoncpp.so.26 /usr/local/lib/libjsoncpp.so.25

# prepare and build
cmake ..
cmake --build . --parallel

# Copy dependenceis reported by `ldd ./pcm` to the same folder to package for running on the host:
cp /usr/lib/libexecinfo.so.1 .
cp /usr/lib/libc++.so.1 .
cp /lib/libcxxrt.so.1 .
cp /lib/libm.so.5 .
cp /lib/libgcc_s.so.1 .
cp /lib/libthr.so.3 .
cp /lib/libc.so.7 .
cp /lib/libelf.so.2 .

# copy the bin folder the host, and run pointing dynamic linker to the current folder:
LD_LIBRARY_PATH=. ./pcm

to look at what’s actualy going on here:

Core C-state residencies: C0 (active,non-halted): 16.21 %; C1: 31.32 %; C3: 0.00 %; C6: 52.47 %; C7: 0.00 %;
 Package C-state residencies:  C0: 78.28 %; C2: 19.32 %; C3: 0.00 %; C6: 2.40 %; C7: 0.00 %;
                             ┌────────────────────────────────────────────────────────────────────────────────┐
 Core    C-state distribution│00000000000001111111111111111111111111666666666666666666666666666666666666666666│
                             └────────────────────────────────────────────────────────────────────────────────┘
                             ┌────────────────────────────────────────────────────────────────────────────────┐
 Package C-state distribution│00000000000000000000000000000000000000000000000000000000000000022222222222222266│
                             └────────────────────────────────────────────────────────────────────────────────┘
---------------------------------------------------------------------------------------------------------------

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0     QPI1    |  QPI0   QPI1
---------------------------------------------------------------------------------------------------------------
 SKT    0      112 M    113 M   |    0%     0%
 SKT    1      120 M    116 M   |    0%     0%
---------------------------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:  462 M

So… CPU cores spend in states lower than C0 over 75% of time, and yet the package is only 18% of time gets to go to sleep. Perhaps QPI contributes to this awakenes.

Suspicion confirmed!

Next step – Yank out second CPU. But before that I need to find out how to put four devices (NIC, HBA, and two SSDs) into three remaining PCI slots…

LOL. After reading readme, I realized I could just pkg install intel-pcm…

An interestign line from the pcm:

Package thermal spec power: 135 Watt; Package minimum power: 43 Watt; Package maximum power: 270 Watt;

So, yanking out one CPU I shall save at least 40W! Nice!

arrogantrabbit · June 17, 2025, 5:39am

This is obviously BS, because according to the same tool, each CPU consumes about 16 watts. Not 40.

MEM (GB)->|  READ |  WRITE | LOCAL | CPU energy | DIMM energy | LLCRDMISSLAT (ns)| UncFREQ (Ghz)|
---------------------------------------------------------------------------------------------------------------
 SKT   0     0.54     0.44   58 %      16.72       4.67         193.23             1.20
 SKT   1     0.33     0.28   52 %      16.67       3.94         195.53             1.20
---------------------------------------------------------------------------------------------------------------
       *     0.87     0.72   56 %      35.39       8.60         194.36             1.20

Ottetal · June 17, 2025, 6:33am

I’m sure the CPU can use 40 watts, but 16 at idle-ish seems very correct for your chosen SKU

arrogantrabbit · June 18, 2025, 3:18am

Instead of bending over backwards trying to cram devices I need into PCIE slots I don’t have, I decided to replace the MLB: I got an extremely good deal on X10SRL-F (cheaper than a single SSD!) – a nice downgrade from my X10DRH-iT in terms of power consumption and an upgrade in terms of PCIE slots.

before

after

The saga will resume once the board arrives.

Ottetal · June 18, 2025, 5:45am

It looks like a nicely done, densely packed board. If you’re running 2DPC, you’ll see some nice power savings dropping down to 1DPC, but of course you’ll have to get new Dimms, and you still want to keep at least four dimms for the quad-channel memory controller.

arrogantrabbit · June 18, 2025, 6:24am

I’m a big fan of supermicro. Every their board is no-nonsense purposefully desigend for a specific purpose, all while adhering to common standards so you are not locked in into the ecosystem. (even though I dont’ see anything wrong with that)

The lucky coinsidence that right now I have 128GB, in eight 16GB sticks (18ASF2G72PDZ-2G6E1, according to dmidecode). So no new dimms required It’s a nas – even if I end up with two dimms per channel – it doesnt matter.

EasyRhino · June 18, 2025, 2:05pm

But each DIMM can take up to 5W THAT WOULD BE 20W MY MAN?!?!