Re: autovectorization of outer loop

Alexey Salmin <alexey.salmin@xxxxxxxxx> · Mon, 15 May 2017 12:18:10 +0300

My guessing is that ICC first splits the for-i loop into two nested
loops and then interchange the inner part of the for-i loop with the
for-k loop. Probably that's just another name for "transform the inner
loop to operate on vectors".

Anyway, the question is whether the GCC is able to accomplish the same result.

On Mon, May 15, 2017 at 12:01 PM, Jyotirmoy Bhattacharya
<jyotirmoy@xxxxxxxxxxxxx> wrote:
>
>
> On May 15, 2017 1:44 PM, "Alexey Salmin" <alexey.salmin@xxxxxxxxx> wrote:
>
> Most likely ICC was able to interchange inner and outer loops and then
> vectorize the inner one.GCC
>
>  manages to do the vectorization if you
> interchange loops manually, unfortunately it's neither a small change
> nor a memory-efficient one, unless you also split the inner loop into
> 32-byte chunks (that would be 3 nested loops and basically the same
> job as manual vectorization).
>
>
> If I'm reading the generated assembly correctly ICC does not interchange the
> loops but rather transforms the inner loop to operate on vectors. As you
> point out, transforming the loops would not be a good idea here.
>
>
> On Wed, May 10, 2017 at 9:31 AM, Jyotirmoy Bhattacharya
> <jyotirmoy@xxxxxxxxxxxxx> wrote:
>> I have the following C++ code that evaluates a Chebyshev polynomial
>> using Clenshaw's algorithm
>>
>> void cheby_eval(double *coeffs,int n,double *xs,double *ys,int m)
>> {
>>   #pragma omp simd
>>   for (int i=0;i<m;i++){
>>     double x = xs[i];
>>     double u0=0,u1=0,u2=0;
>>     for (int k=n;k>=0;k--){
>>       u2 = u1;
>>       u1 = u0;
>>       u0 = 2*x*u1-u2+coeffs[k];
>>     }
>>     ys[i] = 0.5*(coeffs[0]+u0-u2);
>>   }
>> }
>>
>> I'm hoping for an autovectorization of the outer loop so that the
>> inner loop operates on vectors.
>>
>> When compiled with
>>
>> g++ -march=haswell -O3 -fopt-info-vec-missed -S chebyshev.cc
>>
>> using g++ 6.3.0, no vectorization happens I get the messages
>>
>> chebyshev.cc:11:17: note: not vectorized: control flow in loop.
>> chebyshev.cc:11:17: note: bad loop form.
>> chebyshev.cc:14:19: note: intermediate value used outside loop.
>> chebyshev.cc:14:19: note: Unknown def-use cycle pattern.
>> chebyshev.cc:14:19: note: reduction used in loop.
>> chebyshev.cc:14:19: note: Unknown def-use cycle pattern.
>> chebyshev.cc:14:19: note: Unsupported pattern.
>> chebyshev.cc:14:19: note: Unsupported pattern.
>> chebyshev.cc:14:19: note: not vectorized: unsupported use in stmt.
>> chebyshev.cc:14:19: note: unexpected pattern.
>> chebyshev.cc:11:17: note: not vectorized: not enough data-refs in basic
>> block.
>> chebyshev.cc:21:1: note: not vectorized: not enough data-refs in basic
>> block.
>> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic
>> block.
>> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic
>> block.
>> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic
>> block.
>> chebyshev.cc:11:17: note: not consecutive access _27 = *coeffs_20(D);
>> chebyshev.cc:11:17: note: not vectorized: no grouped stores in basic
>> block.
>>
>> On the same code icc vectorizes the outer loop as expected.
>>
>> I was wondering if there are small ways in which I can change my code
>> to help gcc's autovectorizer to succeed. I would also appreciate any
>> pointers to documentation or gcc source that can help me better
>> understand how gcc's autovectorization of outer loops works.
>>
>> Regards,
>> Jyotirmoy Bhattacharya
>>
>> PS. The interesting part of icc's assembler output is
>>
>> ..B1.4:                         # Preds ..B1.8 ..B1.3
>>         xorl      %r15d, %r15d                                  #14.5
>>         xorl      %ebx, %ebx                                    #14.21
>>         testq     %rsi, %rsi                                    #14.21
>>         vmovupd   (%rdx,%r9,8), %ymm3                           #12.16
>>         vxorpd    %ymm5, %ymm5, %ymm5                           #13.14
>>         vmovdqa   %ymm1, %ymm4                                  #13.19
>>         vmovdqa   %ymm1, %ymm2                                  #13.24
>>         jl        ..B1.8        # Prob 2%                       #14.21
>>
>> ..B1.5:                         # Preds ..B1.4
>>         vaddpd    %ymm3, %ymm3, %ymm3                           #17.14
>>
>> ..B1.6:                         # Preds ..B1.6 ..B1.5
>>         vmovapd   %ymm4, %ymm2                                  #20.3
>>         incq      %r15                                          #14.5
>>         vmovapd   %ymm5, %ymm4                                  #20.3
>>         vfmsub213pd %ymm2, %ymm3, %ymm5                         #17.19
>>         vbroadcastsd (%r11,%rbx,8), %ymm6                       #17.22
>>         decq      %rbx
>>         vaddpd    %ymm5, %ymm6, %ymm5                           #17.22
>>         cmpq      %r10, %r15                                    #14.5
>>         jb        ..B1.6        # Prob 82%                      #14.5
>>
>> ..B1.8:                         # Preds ..B1.6 ..B1.4
>>         vbroadcastsd (%rdi), %ymm3                              #19.18
>>         vaddpd    %ymm3, %ymm5, %ymm4                           #19.28
>>         vsubpd    %ymm2, %ymm4, %ymm2                           #19.31
>>         vmulpd    %ymm2, %ymm0, %ymm5                           #19.31
>>         vmovupd   %ymm5, (%rcx,%r9,8)                           #19.5
>>         addq      $4, %r9                                       #11.3
>>         cmpq      %r8, %r9                                      #11.3
>>         jb        ..B1.4        # Prob 82%                      #11.3
>
>