On 05/27/2012 03:27 PM, Chris Kerr wrote:
Hi,
I'm working on RMCProfile ( http://www.rmcprofile.org ), trying to reduce the
run time (which is on the order of 3 CPU years for my datasets). I've done
some profiling and about 75% of the runtime is spent executing an operation of
the form y(p:q) += a * x(r:s) [where q-p == s-r] - i.e. DAXPY but on slices of
the original y and x arrays.
I've tried implementing this as a do loop and using Fortran array intrinsics -
the version using the do loop is significantly faster (the overall running time
of the program on a reduced problem set taking ~1 hour is 7% faster). Looking
at the generated assembly, the do loop uses
incq %rax
cmpq %rcx, %rax
jne .L5
to increment and check the loop counter, whereas the array intrinsic uses
movq %rdx, %rdi
leaq 1(%rdi), %rdx
cmpq %rdi, %r8
jge .L13
The same pattern can be reproduced using much simpler source files - I've
attached the fortran source and the assembly output.
NB as far as I can tell, none of the above rax, rdx or rdi play any part in
their respective calculations, they function purely as counters.
I have two questions about the above behaviour:
1) Why does the array intrinsic method use an extra instruction compared to
the do loop? Is there any way of stopping this?
2) Is there any way of getting these loops to use SIMD instructions such as
vfmaddpd? Currently even when -march=native is switched on, the loop only uses
the scalar instruction vfmaddsd. I'd rather not have to hand-code an unrolled
loop, especially as I'm more used to C and Python so there would probably be
off-by-one errors all over the place on my first ten tries.
Thanks in advance,
Chris Kerr
Department of Chemistry
University of Cambridge
Compilers appear to be concerned about data overlaps for these module
subroutines. ifort generates the same vectorized code body (unrolled by
4 plus vectorization) in each case, but with a scalar copy of the
arrays, which may cut sharply the advantage of vectorization.
gfortran -ftree-vectorizer-verbose complains about storage layout:
Analyzing loop at ck.f90:12
12: not vectorized: data dependence conflict prevents gather/strided
loadD.2022_26 = *y.0_6[D.2019_25];
One would expect ready vectorization of these examples in the usual
context (not using modules). Without alignment specifications, avx-128
code generation is likely.
--
Tim Prince