Benjamin Redelings I wrote:
2. Interestingly, the following is recognized WITHOUT -ffast-math:
for(i=0;i<argc;i++)
f4[i] += f1[i]*f2[i]*f3[i];
That's not a reduction; re-association from strict C standard order
isn't required to vectorize it.
icc makes a similar distinction, this should vectorize when icc
-fp-model source is set (as well as not set), for example.
If I change this to the following, then it needs -ffast-math:
for(i=0;i<argc;i++)
sum += f1[i]*f2[i]*f3[i];
This is essentially doing the first thing, plus also summing the f4[i].
I guess that is the problem?
Yes, vectorization involves at least 4 parallel sums (for float data
type), adding the partials at the end, with numerically different result
from the non-vector case (often, but not always, slightly more
accurate). Also, possibly varying slightly with alignment, and possibly
differing according to whether -msse3 is set.