Utilizing GCC Prefetch Analysis -- Instructions not being generated

Malek Musleh <malek.musleh@xxxxxxxxx> · Thu, 21 Aug 2014 00:33:42 -0400

Hi,

I am trying to determine the performance impact of gcc's internal
software prefetching analysis. I have compiled my benchmarks with the
following flags:

CFLAGS=-O3 -ffast-math -funroll-loops -fprefetch-loop-arrays

However, after compiling, and examining the objdump of the binary, I
do not see any inserted prefetch instructions. Specifically, I am
using an ALPHA cross compiler (gcc version 4.2, so I know it has
prefetching support), and the prefetch instructions that should be
generated are: lds, ldl, or ldq

http://www.eecg.toronto.edu/~moshovos/ACA05/read/Performance%20tips%20for%20Alpha%20Linux%20C%20programmers.htm

My example program code snippet is:

int main (int argc, char *argv[])
{

  for (i = 0; i < 10000; i++){
    for (j = 0; j < 10000; j++){
      a[i][j] = b[j][0] + b[j+1][0];
    }
  }
}

The loops are large, and regular enough so the analysis pass should
determine that prefetching is possible. Would anyone know why the
instructions are not being generated, or if the objdump is not
capturing those prefetch instructions?

As a separate note, I did try to use the gcc prefetch intrinsics, and
examined the objdump:

        __builtin_prefetch (&a[i+j], 1, 1);
   12000060c:   20 00 4f a0     .long 0xa04f0020
   120000610:   1c 00 2f a0     .long 0xa02f001c
   120000614:   01 00 41 40     .long 0x40410001
   120000618:   01 00 e1 43     .long 0x43e10001
   12000061c:   42 16 20 40     .long 0x40201642
   120000620:   30 00 2f 20     lda     t0,48(fp)
   120000624:   01 04 22 40     .long 0x40220401
   120000628:   00 00 e1 8b     .long 0x8be10000
        __builtin_prefetch (&b[i+j], 0, 1);
   12000062c:   20 00 4f a0     .long 0xa04f0020
   120000630:   1c 00 2f a0     .long 0xa02f001c
   120000634:   01 00 41 40     .long 0x40410001
   120000638:   01 00 e1 43     .long 0x43e10001
   12000063c:   42 16 20 40     .long 0x40201642
   120000640:   70 1f 2f 20     lda     t0,8048(fp)
   120000644:   01 04 22 40     .long 0x40220401
   120000648:   00 00 e1 a3     .long 0xa3e10000

In this case, it seems that the compiler is generating a different set
of instructions for the prefetch instrinsic, and not using what the
alpha manual says.

Thanks,

Malek