Unbalanced CPU utilization on NUMA 4-Socket machine, workload evenly spread on machine.

Andreas Hollmann <hollmann@xxxxxxxxx> · Tue, 7 Jan 2014 09:56:58 +0100

Hi,

I discovered an uneven utilization on my NUMA machine.
The workload is evenly distributed and initialized in parallel,
it's the stream benchmark.

It's a 4-Socket WestmereEX machine with 10 cores per
node and 80 logical CPUs in total.

It runs well for about 5-10 seconds and then the performance
drops and htop shows that the CPUs on Socket 0 are utilized
100 % while socket 1 - 3 are utilized only about 50 %. I'm using
likwid-pin, which does strict thread to core pinning. First thread
on first core in list and so on.

Running the stream benchmark like this

wget www.cs.virginia.edu/stream/FTP/Code/stream.c

gcc -fopenmp -mcmodel=medium -O -DSTREAM_ARRAY_SIZE=1000000000
-DNTIMES=1000 stream.c -o stream.1000M.1000

likwid-pin -c 0-39 ./stream.1000M.1000

The peformance also drops without pinning, but then it's not clear which
thread is running on which core.

What could cause such drops and how to detect it?

- thermal threshold ( checked and everything seems ok )
- powercap ? ( is this implemented on westmereEX ?)
- bad memory? ( why does it run well in the beginning)

Here is also the output of Intel's Performance Counter Monitor (PCM) Tool:

It looks like this in the beginning (output of /opt/PCM/pcm.x 2 -nc):

This values are for 2 seconds!

EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock
ticks'/'invariant timer ticks'
         (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state
         (not in power-saving C state)='unhalted clock
ticks'/'invariant timer ticks while in C0-state
         (includes Intel Turbo Boost)
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax
temperature (thermal headroom):
         0 corresponds to the max temperature

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

----------------------------------------------------------------
 SKT    0     0.13   0.26   0.50    1.00    27.20    11.52     N/A
 SKT    1     0.13   0.26   0.50    1.00    27.14    11.51     N/A
 SKT    2     0.13   0.27   0.50    1.00    27.13    11.51     N/A
 SKT    3     0.13   0.27   0.50    1.00    27.12    11.51     N/A
----------------------------------------------------------------
 TOTAL  *     0.13   0.27   0.50    1.00    108.59    46.05     N/A

 Instructions retired:   42 G ; Active cycles:  160 G ; Time (TSC): 4020 Mticks
 C0 (active,non-halted) core residency: 49.95 %

 C1 core residency: 49.64 %; C3 core residency: 0.03 %; C6 core
residency: 0.37 %;
 C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.53 => corresponds to 13.26 %
utilization for cores in active st
 Instructions per nominal CPU cycle: 0.26 => corresponds to 6.62 %
core utilization over time interval

--- And then drops to this values ---

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

----------------------------------------------------------------
 SKT    0     0.06   0.12   0.50    1.00    13.82    5.28     N/A
 SKT    1     0.06   0.25   0.23    1.00    12.68    4.97     N/A
 SKT    2     0.06   0.25   0.23    1.00    12.67    4.97     N/A
 SKT    3     0.06   0.25   0.23    1.00    12.67    4.96     N/A
----------------------------------------------------------------
 TOTAL  *     0.06   0.20   0.30    1.00    51.84    20.17     N/A

 Instructions retired:   19 G ; Active cycles:   96 G ; Time (TSC):
4021 Mticks ;
 C0 (active,non-halted) core residency: 29.84 %

 C1 core residency: 53.53 %; C3 core residency: 0.00 %; C6 core
residency: 16.62 %;
 C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.40 => corresponds to 9.91 %
utilization for cores in active sta
 Instructions per nominal CPU cycle: 0.12 => corresponds to 2.96 %
core utilization over time interval

----

This lines are interesting: SKT 0 has an IPC 0.12 while SKT 1 of 0.25.
SKT 0: The FREQ is 0.50 (half of the cores are idle because of hypertreading)
SK1: Here we is FREQ only 0.23 which means that also the "active" threads
are idling. Why? Why?

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | READ  | WRITE | TEMP

 SKT    0     0.06   0.12   0.50    1.00    13.82    5.28     N/A
 SKT    1     0.06   0.25   0.23    1.00    12.68    4.97     N/A

Perf similar results, peformance and bandwidth are going down by 50 %.

perf stat --per-socket --interval-print 2000 -a -e
"uncore_mbox_0/event=bbox_cmds_read/","uncore_mbox_1/event=bbox_cmds_write/"
sleep 3600

Help or suggestions would be much appreciated.

Thanks and best regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html