Tickling a weird CPU stall on Haswell

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

I'm running Linux on Haswell.
https://en.wikichip.org/wiki/intel/cpuid
https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
stepping	: 3
microcode	: 0x28
cpu MHz		: 3265.821
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips	: 6585.02
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual


I was playing around with some code, when I ran into a weird stall
issue that degrades run-time by a factor of 2.5(!!)

Back in the K7 days (which dates me) I used to know the µ-architecture guide
by heart (I even reported an undocumented stall condition), but I will confess
that I haven't kept up with details of new µ-arches for over a decade :(


Here's the code in question:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 512

typedef unsigned long long u64;

void inner(u64 *acc, const u64 *a, const u64 *b)
{
#if V1
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (*a), [HI] "r" (*b) : "cc");
#elif V2
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "#adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (*a), [HI] "r" (*b) : "cc");
#endif
}

static int min(int u, int v) { return u < v ? u : v; }

void fun1(u64 *acc, const u64 *a, const u64 *b)
{
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      inner(acc+i+j, a+i, b+j);
}

void fun2(u64 *acc, const u64 *a, const u64 *b)
{
  for (int sum = 0; sum < 2*N-1; ++sum) {
    int v = min(sum, N-1);
    int u = sum - v;
    for (int i = u; i <= v; ++i)
      inner(acc+sum, a+i, b+sum-i);
  }
}

u64 A[N], B[N], ACC1[N*2], ACC2[N*2];

int main(int argc, char **argv)
{
  for (int i = 0; i < N; ++i) {
    A[i] = rand();
    B[i] = rand();
  }

  if (argc < 2) {
    fun1(ACC1, A, B);
    fun2(ACC2, A, B);
    printf("fun1 vs fun2 = %d\n", memcmp(ACC1, ACC2, sizeof(ACC1)));
    return 0;
  }

  int nf = atoi(argv[1]);
  for (int xp = 0; xp < 1000; ++xp) {
    if (nf == 1) fun1(ACC1, A, B);
    if (nf == 2) fun2(ACC1, A, B);
  }

  return 0;
}


The code touches 4*N*8 data bytes = 16 KB
which should /entirely/ fit in L1 D$
(Haswell has 32 KB/core 8-way set associative)

fun1 and fun2 perform /exactly/ the same calculation,
but in a different order.

inner_V2 = inner_V1 with the 3rd ADD commented out.

Here are the observed run-times on my system:

$ gcc -Wall -O2 -march=native -DV1 slower.c -o v1.out
$ gcc -Wall -O2 -march=native -DV2 slower.c -o v2.out

$ time ./v1.out 1
real	0m7,362s
user	0m7,328s
sys	0m0,004s

$ time ./v1.out 2
real	0m2,895s
user	0m2,895s
sys	0m0,000s

$ time ./v2.out 1
real	0m2,896s
user	0m2,884s
sys	0m0,000s

$ time ./v2.out 2
real	0m2,888s
user	0m2,888s
sys	0m0,000s


Why in heaven's name is fun1 with inner_V1
2.5 times slower than any one of
fun1 with inner_V1
fun2 with inner_V1
fun2 with inner_V2 ???

Some kind of memory-aliasing issue?

(I can show generated assembly if someone thinks it's useful,
but it's pretty much what I expected.)

Regards



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux