OK, thanks for the explanation. I tested splitting the loop into two and creating a temporary array having the first array load the data and the second perform the reduction. In the application I'm testing this yielded a speedup of around 1.3 which might be similar to "simulated gather". Thanks, Nicklas On Thu, Oct 4, 2012 at 6:02 PM, Tim Prince <n8tm@xxxxxxx> wrote: > On 10/4/2012 9:52 AM, Nicklas Bo Jensen wrote: >> >> Thanks for your response Tim, >> >> I'm not familiar with the term "simulated gather" and google doesn't >> help much. Would it be something like performing the memory operations >> with scalar instructions first and then perform the reduction with >> vector instructions? Does gcc have such capabilities? >> >> Thanks, >> Nicklas >> >> >> >> On Thu, Oct 4, 2012 at 3:04 PM, Tim Prince <n8tm@xxxxxxx> wrote: >>> >>> On 10/4/2012 8:20 AM, Nicklas Bo Jensen wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to use the autovectorizer in gcc 4.7.2. However I'm getting >>>> bad data references in the .vect comments generated using >>>> -ftree-vectorizer-verbose=, even though I have used restrict on all >>>> arrays. Its a reference with two arrays where one array is used to >>>> index the other: array1[array2[index]]. I don't see how this should >>>> not be vectorizable as we only read from the arrays. >>>> >>>> Is this not supported or is there some clever way to rewrite this? >>>> >>>> Example: >>>> >>>> int foo(int * restrict array1, int * restrict array2) { >>>> int res = 0; >>>> for (int i = 0; i < 50000; i++) { >>>> int v = array2[array1[i]]; //This gives bad data reference comment >>>> in .vect file. >>>> res += v * v; >>>> } >>>> return res; >>>> } >>>> >>>> >>> You would require "simulated gather" to take advantage of SSE4 or AVX >>> scalar >>> to vector register moves (so your -march setting enters in). restrict >>> has >>> no bearing here, as you modify only res and v which have segregated scope >>> from the data regions accessed by pointer. >>> Intel compilers tend to require pragma stuff such as #pragma simd >>> reduction(+: res) to promote "vectorization" using simulated gather. >>> Evidently, such idioms are typically used with floating point data types >>> and >>> -ffast-math or equivalent options to enable associative-math. >>> >>> -- >>> Tim Prince > > Yes, "simulated gather" uses the 32- or 64-bit scalar moves to fill slots in > the 128- or 256-bit register, for cases like yours where the vector > components aren't contiguous in memory. I don't know whether gcc would use > that term, if anyone is working on it. As the term implies, as hardware or > firmware gather instruction support is introduced on future CPUs, the > compiler could switch it in under appropriate -march options. I looked too, > and didn't find any web search references to the terminology. > In the case you pose, only the initial data moves of the one distributed > vector need to use the scalar move simuiation , so vectorization can be > effective. > > -- > Tim Prince >