Hi, I am trying to determine the performance impact of gcc's internal software prefetching analysis. I have compiled my benchmarks with the following flags: CFLAGS=-O3 -ffast-math -funroll-loops -fprefetch-loop-arrays However, after compiling, and examining the objdump of the binary, I do not see any inserted prefetch instructions. Specifically, I am using an ALPHA cross compiler (gcc version 4.2, so I know it has prefetching support), and the prefetch instructions that should be generated are: lds, ldl, or ldq http://www.eecg.toronto.edu/~moshovos/ACA05/read/Performance%20tips%20for%20Alpha%20Linux%20C%20programmers.htm My example program code snippet is: int main (int argc, char *argv[]) { for (i = 0; i < 10000; i++){ for (j = 0; j < 10000; j++){ a[i][j] = b[j][0] + b[j+1][0]; } } } The loops are large, and regular enough so the analysis pass should determine that prefetching is possible. Would anyone know why the instructions are not being generated, or if the objdump is not capturing those prefetch instructions? As a separate note, I did try to use the gcc prefetch intrinsics, and examined the objdump: __builtin_prefetch (&a[i+j], 1, 1); 12000060c: 20 00 4f a0 .long 0xa04f0020 120000610: 1c 00 2f a0 .long 0xa02f001c 120000614: 01 00 41 40 .long 0x40410001 120000618: 01 00 e1 43 .long 0x43e10001 12000061c: 42 16 20 40 .long 0x40201642 120000620: 30 00 2f 20 lda t0,48(fp) 120000624: 01 04 22 40 .long 0x40220401 120000628: 00 00 e1 8b .long 0x8be10000 __builtin_prefetch (&b[i+j], 0, 1); 12000062c: 20 00 4f a0 .long 0xa04f0020 120000630: 1c 00 2f a0 .long 0xa02f001c 120000634: 01 00 41 40 .long 0x40410001 120000638: 01 00 e1 43 .long 0x43e10001 12000063c: 42 16 20 40 .long 0x40201642 120000640: 70 1f 2f 20 lda t0,8048(fp) 120000644: 01 04 22 40 .long 0x40220401 120000648: 00 00 e1 a3 .long 0xa3e10000 In this case, it seems that the compiler is generating a different set of instructions for the prefetch instrinsic, and not using what the alpha manual says. Thanks, Malek