Re: Autovectorization help

Nicklas Bo Jensen <nbjensen@xxxxxxxxx> · Thu, 4 Oct 2012 18:25:38 +0200

OK, thanks for the explanation.

I tested splitting the loop into two and creating a temporary array
having the first array load the data and the second perform the
reduction. In the application I'm testing this yielded a speedup of
around 1.3 which might be similar to "simulated gather".

Thanks,
Nicklas

On Thu, Oct 4, 2012 at 6:02 PM, Tim Prince <n8tm@xxxxxxx> wrote:
> On 10/4/2012 9:52 AM, Nicklas Bo Jensen wrote:
>>
>> Thanks for your response Tim,
>>
>> I'm not familiar with the term "simulated gather" and google doesn't
>> help much. Would it be something like performing the memory operations
>> with scalar instructions first and then perform the reduction with
>> vector instructions? Does gcc have such capabilities?
>>
>> Thanks,
>> Nicklas
>>
>>
>>
>> On Thu, Oct 4, 2012 at 3:04 PM, Tim Prince <n8tm@xxxxxxx> wrote:
>>>
>>> On 10/4/2012 8:20 AM, Nicklas Bo Jensen wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use the autovectorizer in gcc 4.7.2. However I'm getting
>>>> bad data references in the .vect comments generated using
>>>> -ftree-vectorizer-verbose=, even though I have used restrict on all
>>>> arrays. Its a reference with two arrays where one array is used to
>>>> index the other: array1[array2[index]]. I don't see how this should
>>>> not be vectorizable as we only read from the arrays.
>>>>
>>>> Is this not supported or is there some clever way to rewrite this?
>>>>
>>>> Example:
>>>>
>>>> int foo(int * restrict array1, int * restrict array2) {
>>>>     int res = 0;
>>>>     for (int i = 0; i < 50000; i++) {
>>>>       int v = array2[array1[i]]; //This gives bad data reference comment
>>>> in .vect file.
>>>>       res += v * v;
>>>>     }
>>>>     return res;
>>>> }
>>>>
>>>>
>>> You would require "simulated gather" to take advantage of SSE4 or AVX
>>> scalar
>>> to vector register moves (so your -march setting enters in).  restrict
>>> has
>>> no bearing here, as you modify only res and v which have segregated scope
>>> from the data regions accessed by pointer.
>>> Intel compilers tend to require pragma stuff such as #pragma simd
>>> reduction(+: res) to promote "vectorization" using simulated gather.
>>> Evidently, such idioms are typically used with floating point data types
>>> and
>>> -ffast-math or equivalent options to enable associative-math.
>>>
>>> --
>>> Tim Prince
>
> Yes, "simulated gather" uses the 32- or 64-bit scalar moves to fill slots in
> the 128- or 256-bit register, for cases like yours where the vector
> components aren't contiguous in memory.  I don't know whether gcc would use
> that term, if anyone is working on it.  As the term implies, as hardware or
> firmware gather instruction support is introduced on future CPUs, the
> compiler could switch it in under appropriate -march options.  I looked too,
> and didn't find any web search references to the terminology.
> In the case you pose, only the initial data moves of the one distributed
> vector need to use the scalar move simuiation , so vectorization can be
> effective.
>
> --
> Tim Prince
>