On 10/4/2012 9:52 AM, Nicklas Bo Jensen wrote:
Thanks for your response Tim,
I'm not familiar with the term "simulated gather" and google doesn't
help much. Would it be something like performing the memory operations
with scalar instructions first and then perform the reduction with
vector instructions? Does gcc have such capabilities?
Thanks,
Nicklas
On Thu, Oct 4, 2012 at 3:04 PM, Tim Prince <n8tm@xxxxxxx> wrote:
On 10/4/2012 8:20 AM, Nicklas Bo Jensen wrote:
Hi,
I'm trying to use the autovectorizer in gcc 4.7.2. However I'm getting
bad data references in the .vect comments generated using
-ftree-vectorizer-verbose=, even though I have used restrict on all
arrays. Its a reference with two arrays where one array is used to
index the other: array1[array2[index]]. I don't see how this should
not be vectorizable as we only read from the arrays.
Is this not supported or is there some clever way to rewrite this?
Example:
int foo(int * restrict array1, int * restrict array2) {
int res = 0;
for (int i = 0; i < 50000; i++) {
int v = array2[array1[i]]; //This gives bad data reference comment
in .vect file.
res += v * v;
}
return res;
}
You would require "simulated gather" to take advantage of SSE4 or AVX scalar
to vector register moves (so your -march setting enters in). restrict has
no bearing here, as you modify only res and v which have segregated scope
from the data regions accessed by pointer.
Intel compilers tend to require pragma stuff such as #pragma simd
reduction(+: res) to promote "vectorization" using simulated gather.
Evidently, such idioms are typically used with floating point data types and
-ffast-math or equivalent options to enable associative-math.
--
Tim Prince
Yes, "simulated gather" uses the 32- or 64-bit scalar moves to fill
slots in the 128- or 256-bit register, for cases like yours where the
vector components aren't contiguous in memory. I don't know whether gcc
would use that term, if anyone is working on it. As the term implies,
as hardware or firmware gather instruction support is introduced on
future CPUs, the compiler could switch it in under appropriate -march
options. I looked too, and didn't find any web search references to the
terminology.
In the case you pose, only the initial data moves of the one distributed
vector need to use the scalar move simuiation , so vectorization can be
effective.
--
Tim Prince