Re: Autovectorization help

Tim Prince <n8tm@xxxxxxx> · Thu, 04 Oct 2012 12:02:59 -0400

On 10/4/2012 9:52 AM, Nicklas Bo Jensen wrote:
Thanks for your response Tim,

I'm not familiar with the term "simulated gather" and google doesn't
help much. Would it be something like performing the memory operations
with scalar instructions first and then perform the reduction with
vector instructions? Does gcc have such capabilities?

Thanks,
Nicklas

On Thu, Oct 4, 2012 at 3:04 PM, Tim Prince <n8tm@xxxxxxx> wrote:
On 10/4/2012 8:20 AM, Nicklas Bo Jensen wrote:
Hi,

I'm trying to use the autovectorizer in gcc 4.7.2. However I'm getting
bad data references in the .vect comments generated using
-ftree-vectorizer-verbose=, even though I have used restrict on all
arrays. Its a reference with two arrays where one array is used to
index the other: array1[array2[index]]. I don't see how this should
not be vectorizable as we only read from the arrays.

Is this not supported or is there some clever way to rewrite this?

Example:

int foo(int * restrict array1, int * restrict array2) {
    int res = 0;
    for (int i = 0; i < 50000; i++) {
      int v = array2[array1[i]]; //This gives bad data reference comment
in .vect file.
      res += v * v;
    }
    return res;
}

You would require "simulated gather" to take advantage of SSE4 or AVX scalar
to vector register moves (so your -march setting enters in).  restrict has
no bearing here, as you modify only res and v which have segregated scope
from the data regions accessed by pointer.
Intel compilers tend to require pragma stuff such as #pragma simd
reduction(+: res) to promote "vectorization" using simulated gather.
Evidently, such idioms are typically used with floating point data types and
-ffast-math or equivalent options to enable associative-math.

--
Tim Prince
Yes, "simulated gather" uses the 32- or 64-bit scalar moves to fill 
slots in the 128- or 256-bit register, for cases like yours where the 
vector components aren't contiguous in memory.  I don't know whether gcc 
would use that term, if anyone is working on it.  As the term implies, 
as hardware or firmware gather instruction support is introduced on 
future CPUs, the compiler could switch it in under appropriate -march 
options.  I looked too, and didn't find any web search references to the 
terminology.
In the case you pose, only the initial data moves of the one distributed 
vector need to use the scalar move simuiation , so vectorization can be 
effective.

--
Tim Prince