Side quest into processor interrupt affinity.
TLDR – unsuccessful.
Noting that the ethenet controller, and two special device SSD constantly are bugging the processors with interrupts, supposedly preventign them from entering or staying in deeper C states, I thought – what will happen if we move those devices to PICE slots connected to the same processor, set interrupt affinities to only bug one of them, and also set processor affiniites on storj jails – thus hopefully offloading work from the first processor, allowing it to sleep longer?
So, I moved 4 devices in question to PCI slots connected to CPU1, according to the mlb manual. Device names are nvme2
, nvme1
, mpr0
, igc0
.
while IFS= read -r device; do
echo $device | sed -nE 's/^(.*): <(.*)>.*(numa-domain [0-9]+) .*$/\1: \3: \2/p'
done <<< "$(dmesg | egrep '(nvme2|nvme1|mpr0|igc0).*numa-domain')"
The output confirms they are on the same numa domain:
nvme1: numa-domain 1: Intel DC PC3500
nvme2: numa-domain 1: Intel DC PC3500
igc0: numa-domain 1: Intel(R) Ethernet Controller I225-V
mpr0: numa-domain 1: Avago Technologies (LSI) SAS3008
Next, lets print device affinities and interrupt rates:
# Printing their affinities
while IFS= read -r device; do
# echo $device
irq="$(echo $device | sed -nE 's/^irq([0-9]+):.*$/\1/p')"
name="$(echo $device | sed -nE 's/^.*: ([^ ]+).*$/\1/p')"
stats="$(cpuset -g -x "$irq")"
rate="$(echo $device | sed -nE 's/^.* ([0-9]+)$/\1/p')"
printf "%-11s %-11s | %s\n" "$name" "($rate tps)" "$stats"
done <<< "$(vmstat -i | egrep '(nvme2|nvme1|mpr0|igc0)')"
yields:
vme1:admin (0 tps) | irq 99 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
nvme1:io0 (141 tps) | irq 100 mask: 0
nvme1:io1 (122 tps) | irq 101 mask: 2
nvme1:io2 (123 tps) | irq 102 mask: 4
nvme1:io3 (123 tps) | irq 103 mask: 6
nvme1:io4 (105 tps) | irq 104 mask: 8
nvme1:io5 (109 tps) | irq 105 mask: 10
nvme1:io6 (136 tps) | irq 106 mask: 12
nvme1:io7 (125 tps) | irq 107 mask: 14
nvme2:admin (0 tps) | irq 108 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
nvme2:io0 (141 tps) | irq 109 mask: 1
nvme2:io1 (122 tps) | irq 110 mask: 3
nvme2:io2 (123 tps) | irq 111 mask: 5
nvme2:io3 (123 tps) | irq 112 mask: 7
nvme2:io4 (105 tps) | irq 113 mask: 9
nvme2:io5 (109 tps) | irq 114 mask: 11
nvme2:io6 (137 tps) | irq 115 mask: 13
nvme2:io7 (125 tps) | irq 116 mask: 15
igc0:rxq0 (1963 tps) | irq 117 mask: 8
igc0:rxq1 (1540 tps) | irq 118 mask: 10
igc0:rxq2 (974 tps) | irq 119 mask: 12
igc0:rxq3 (1180 tps) | irq 120 mask: 14
igc0:aq (0 tps) | irq 121 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
mpr0 (173 tps) | irq 122 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
As you can see, the nvme drives are bugging both CPUs, as does the HBA.
Unfortunately, I discoverd that I cannot assing interrupt affinity to a list or range of cores – the choice is either one or all.
For example, this fails:
% sudo cpuset -l '8,15' -x 122
cpuset: setaffinity: Invalid argument
% sudo cpuset -l '8-15' -x 122
cpuset: setaffinity: Invalid argument
but this works:
% sudo cpuset -l '8' -x 122
% sudo cpuset -l 'all' -x 122
That’s fine, after carefully randomly picking numbers from range 8-15 I ended up setting affinities as follows:
sudo cpuset -l '12' -x 100
sudo cpuset -l '13' -x 101
sudo cpuset -l '14' -x 102
sudo cpuset -l '15' -x 103
sudo cpuset -l '8' -x 109
sudo cpuset -l '9' -x 110
sudo cpuset -l '10' -x 111
sudo cpuset -l '11' -x 112
sudo cpuset -l '15' -x 122
Thus all the listed devices now bug CPU1. I’ve then also set stoj jail affinity to the same CPU:
% jls | grep storj
4 storj /mnt/pool1/iocage/jails/storj/root
% sudo cpuset -l '8-15' -j 4
The load visibly shifted to CPU1, but no perceptible difference in power consumption occured 
I’m suspecting it’s the QPI, part of Uncore, to blame – I guess it has to be alive when any processor is touching memory. This will prevent package from entering deeper deeper C state, albeit idleness of the processor cores themselves might have allowed sleep.
I’m out of my depth here, and out of ideas. Maybe CPU experts can chime in and tell me that I’m digging in the wrong direction, and I shall stop kicking the dead horse defenestrate 10 year old platform and buy a modern E-2400 based solution if I want to shave off a few more watts.