Hello everyone, I'm running Linux on Haswell. https://en.wikichip.org/wiki/intel/cpuid https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client) $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz stepping : 3 microcode : 0x28 cpu MHz : 3265.821 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 6585.02 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual I was playing around with some code, when I ran into a weird stall issue that degrades run-time by a factor of 2.5(!!) Back in the K7 days (which dates me) I used to know the µ-architecture guide by heart (I even reported an undocumented stall condition), but I will confess that I haven't kept up with details of new µ-arches for over a decade :( Here's the code in question: #include <stdio.h> #include <stdlib.h> #include <string.h> #define N 512 typedef unsigned long long u64; void inner(u64 *acc, const u64 *a, const u64 *b) { #if V1 asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" : [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) : [LO] "r" (*a), [HI] "r" (*b) : "cc"); #elif V2 asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "#adc $0, %[D2]" : [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) : [LO] "r" (*a), [HI] "r" (*b) : "cc"); #endif } static int min(int u, int v) { return u < v ? u : v; } void fun1(u64 *acc, const u64 *a, const u64 *b) { for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) inner(acc+i+j, a+i, b+j); } void fun2(u64 *acc, const u64 *a, const u64 *b) { for (int sum = 0; sum < 2*N-1; ++sum) { int v = min(sum, N-1); int u = sum - v; for (int i = u; i <= v; ++i) inner(acc+sum, a+i, b+sum-i); } } u64 A[N], B[N], ACC1[N*2], ACC2[N*2]; int main(int argc, char **argv) { for (int i = 0; i < N; ++i) { A[i] = rand(); B[i] = rand(); } if (argc < 2) { fun1(ACC1, A, B); fun2(ACC2, A, B); printf("fun1 vs fun2 = %d\n", memcmp(ACC1, ACC2, sizeof(ACC1))); return 0; } int nf = atoi(argv[1]); for (int xp = 0; xp < 1000; ++xp) { if (nf == 1) fun1(ACC1, A, B); if (nf == 2) fun2(ACC1, A, B); } return 0; } The code touches 4*N*8 data bytes = 16 KB which should /entirely/ fit in L1 D$ (Haswell has 32 KB/core 8-way set associative) fun1 and fun2 perform /exactly/ the same calculation, but in a different order. inner_V2 = inner_V1 with the 3rd ADD commented out. Here are the observed run-times on my system: $ gcc -Wall -O2 -march=native -DV1 slower.c -o v1.out $ gcc -Wall -O2 -march=native -DV2 slower.c -o v2.out $ time ./v1.out 1 real 0m7,362s user 0m7,328s sys 0m0,004s $ time ./v1.out 2 real 0m2,895s user 0m2,895s sys 0m0,000s $ time ./v2.out 1 real 0m2,896s user 0m2,884s sys 0m0,000s $ time ./v2.out 2 real 0m2,888s user 0m2,888s sys 0m0,000s Why in heaven's name is fun1 with inner_V1 2.5 times slower than any one of fun1 with inner_V1 fun2 with inner_V1 fun2 with inner_V2 ??? Some kind of memory-aliasing issue? (I can show generated assembly if someone thinks it's useful, but it's pretty much what I expected.) Regards