Thomas Witzel wrote:
The option -fprefetch-loop-arrays generates prefetch commands for non-vectorized code, but not for vectorized one. Is that the intended functionality ? Is there a way to get prefetching also for the vectorized routines ? Thanks, Thomas Example: (compiled with g++-4.5.0) g++ -O3 -fcx-fortran-rules -fprefetch-loop-arrays -mtune=core2 -march=core2 -mssse3 -S -c ../test_loop.cpp The code for a complex multiplication loop done this way: void f(std::complex<float> *a, std::complex<float> *b, std::complex<float> *r) { for(std::size_t s=0; s<N; s++) r[s] = a[s]*b[s]; } Is generated two-fold, one vectorized (.L3) and one not (L5):
It's certainly hard to guess the effect of pre-fetching only in the remainder loop (early 32-bit pentium4 style?). As you have set -mtune=core2, it seems reasonable the compiler would not optimize for Athlon-32, which may have been the most recent common CPU without effective hardware prefetch for vectorized loops. I don't really expect gcc to attempt further optimization specific to -mssse3, now that it's about 2 years out of production.