Hi, I discovered an uneven utilization on my NUMA machine. The workload is evenly distributed and initialized in parallel, it's the stream benchmark. It's a 4-Socket WestmereEX machine with 10 cores per node and 80 logical CPUs in total. It runs well for about 5-10 seconds and then the performance drops and htop shows that the CPUs on Socket 0 are utilized 100 % while socket 1 - 3 are utilized only about 50 %. I'm using likwid-pin, which does strict thread to core pinning. First thread on first core in list and so on. Running the stream benchmark like this wget www.cs.virginia.edu/stream/FTP/Code/stream.c gcc -fopenmp -mcmodel=medium -O -DSTREAM_ARRAY_SIZE=1000000000 -DNTIMES=1000 stream.c -o stream.1000M.1000 likwid-pin -c 0-39 ./stream.1000M.1000 The peformance also drops without pinning, but then it's not clear which thread is running on which core. What could cause such drops and how to detect it? - thermal threshold ( checked and everything seems ok ) - powercap ? ( is this implemented on westmereEX ?) - bad memory? ( why does it run well in the beginning) Here is also the output of Intel's Performance Counter Monitor (PCM) Tool: It looks like this in the beginning (output of /opt/PCM/pcm.x 2 -nc): This values are for 2 seconds! EXEC : instructions per nominal CPU cycle IPC : instructions per CPU cycle FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost) AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state (includes Intel Turbo Boost) READ : bytes read from memory controller (in GBytes) WRITE : bytes written to memory controller (in GBytes) TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP ---------------------------------------------------------------- SKT 0 0.13 0.26 0.50 1.00 27.20 11.52 N/A SKT 1 0.13 0.26 0.50 1.00 27.14 11.51 N/A SKT 2 0.13 0.27 0.50 1.00 27.13 11.51 N/A SKT 3 0.13 0.27 0.50 1.00 27.12 11.51 N/A ---------------------------------------------------------------- TOTAL * 0.13 0.27 0.50 1.00 108.59 46.05 N/A Instructions retired: 42 G ; Active cycles: 160 G ; Time (TSC): 4020 Mticks C0 (active,non-halted) core residency: 49.95 % C1 core residency: 49.64 %; C3 core residency: 0.03 %; C6 core residency: 0.37 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; PHYSICAL CORE IPC : 0.53 => corresponds to 13.26 % utilization for cores in active st Instructions per nominal CPU cycle: 0.26 => corresponds to 6.62 % core utilization over time interval --- And then drops to this values --- Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP ---------------------------------------------------------------- SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A SKT 2 0.06 0.25 0.23 1.00 12.67 4.97 N/A SKT 3 0.06 0.25 0.23 1.00 12.67 4.96 N/A ---------------------------------------------------------------- TOTAL * 0.06 0.20 0.30 1.00 51.84 20.17 N/A Instructions retired: 19 G ; Active cycles: 96 G ; Time (TSC): 4021 Mticks ; C0 (active,non-halted) core residency: 29.84 % C1 core residency: 53.53 %; C3 core residency: 0.00 %; C6 core residency: 16.62 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; PHYSICAL CORE IPC : 0.40 => corresponds to 9.91 % utilization for cores in active sta Instructions per nominal CPU cycle: 0.12 => corresponds to 2.96 % core utilization over time interval ---- This lines are interesting: SKT 0 has an IPC 0.12 while SKT 1 of 0.25. SKT 0: The FREQ is 0.50 (half of the cores are idle because of hypertreading) SK1: Here we is FREQ only 0.23 which means that also the "active" threads are idling. Why? Why? Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A Perf similar results, peformance and bandwidth are going down by 50 %. perf stat --per-socket --interval-print 2000 -a -e "uncore_mbox_0/event=bbox_cmds_read/","uncore_mbox_1/event=bbox_cmds_write/" sleep 3600 Help or suggestions would be much appreciated. Thanks and best regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html