Re: autovectorization of outer loop

Alexey Salmin <alexey.salmin@xxxxxxxxx> · Mon, 15 May 2017 11:14:23 +0300

Most likely ICC was able to interchange inner and outer loops and then
vectorize the inner one. GCC manages to do the vectorization if you
interchange loops manually, unfortunately it's neither a small change
nor a memory-efficient one, unless you also split the inner loop into
32-byte chunks (that would be 3 nested loops and basically the same
job as manual vectorization).

I'd expect GCC to interchange loops given the -floop-interchange
option but it doesn't seem to happen even in the simpler case:
for (int i=0;i<m;i++){
  for (int k=n;k>=0;k--){
    u2[i] = u1[i];
    u1[i] = u0[i];
    u0[i] = 2*xs[i]*u1[i]-u2[i]+coeffs[k]
  }
}

Probably the -floop-interchange option is only about the cache
efficiency, I'm not sure. This looks like a missed optimization
opportunity so I'd file a bugzilla tracker.

On Wed, May 10, 2017 at 9:31 AM, Jyotirmoy Bhattacharya
<jyotirmoy@xxxxxxxxxxxxx> wrote:
> I have the following C++ code that evaluates a Chebyshev polynomial
> using Clenshaw's algorithm
>
> void cheby_eval(double *coeffs,int n,double *xs,double *ys,int m)
> {
>   #pragma omp simd
>   for (int i=0;i<m;i++){
>     double x = xs[i];
>     double u0=0,u1=0,u2=0;
>     for (int k=n;k>=0;k--){
>       u2 = u1;
>       u1 = u0;
>       u0 = 2*x*u1-u2+coeffs[k];
>     }
>     ys[i] = 0.5*(coeffs[0]+u0-u2);
>   }
> }
>
> I'm hoping for an autovectorization of the outer loop so that the
> inner loop operates on vectors.
>
> When compiled with
>
> g++ -march=haswell -O3 -fopt-info-vec-missed -S chebyshev.cc
>
> using g++ 6.3.0, no vectorization happens I get the messages
>
> chebyshev.cc:11:17: note: not vectorized: control flow in loop.
> chebyshev.cc:11:17: note: bad loop form.
> chebyshev.cc:14:19: note: intermediate value used outside loop.
> chebyshev.cc:14:19: note: Unknown def-use cycle pattern.
> chebyshev.cc:14:19: note: reduction used in loop.
> chebyshev.cc:14:19: note: Unknown def-use cycle pattern.
> chebyshev.cc:14:19: note: Unsupported pattern.
> chebyshev.cc:14:19: note: Unsupported pattern.
> chebyshev.cc:14:19: note: not vectorized: unsupported use in stmt.
> chebyshev.cc:14:19: note: unexpected pattern.
> chebyshev.cc:11:17: note: not vectorized: not enough data-refs in basic block.
> chebyshev.cc:21:1: note: not vectorized: not enough data-refs in basic block.
> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block.
> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block.
> chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block.
> chebyshev.cc:11:17: note: not consecutive access _27 = *coeffs_20(D);
> chebyshev.cc:11:17: note: not vectorized: no grouped stores in basic block.
>
> On the same code icc vectorizes the outer loop as expected.
>
> I was wondering if there are small ways in which I can change my code
> to help gcc's autovectorizer to succeed. I would also appreciate any
> pointers to documentation or gcc source that can help me better
> understand how gcc's autovectorization of outer loops works.
>
> Regards,
> Jyotirmoy Bhattacharya
>
> PS. The interesting part of icc's assembler output is
>
> ..B1.4:                         # Preds ..B1.8 ..B1.3
>         xorl      %r15d, %r15d                                  #14.5
>         xorl      %ebx, %ebx                                    #14.21
>         testq     %rsi, %rsi                                    #14.21
>         vmovupd   (%rdx,%r9,8), %ymm3                           #12.16
>         vxorpd    %ymm5, %ymm5, %ymm5                           #13.14
>         vmovdqa   %ymm1, %ymm4                                  #13.19
>         vmovdqa   %ymm1, %ymm2                                  #13.24
>         jl        ..B1.8        # Prob 2%                       #14.21
>
> ..B1.5:                         # Preds ..B1.4
>         vaddpd    %ymm3, %ymm3, %ymm3                           #17.14
>
> ..B1.6:                         # Preds ..B1.6 ..B1.5
>         vmovapd   %ymm4, %ymm2                                  #20.3
>         incq      %r15                                          #14.5
>         vmovapd   %ymm5, %ymm4                                  #20.3
>         vfmsub213pd %ymm2, %ymm3, %ymm5                         #17.19
>         vbroadcastsd (%r11,%rbx,8), %ymm6                       #17.22
>         decq      %rbx
>         vaddpd    %ymm5, %ymm6, %ymm5                           #17.22
>         cmpq      %r10, %r15                                    #14.5
>         jb        ..B1.6        # Prob 82%                      #14.5
>
> ..B1.8:                         # Preds ..B1.6 ..B1.4
>         vbroadcastsd (%rdi), %ymm3                              #19.18
>         vaddpd    %ymm3, %ymm5, %ymm4                           #19.28
>         vsubpd    %ymm2, %ymm4, %ymm2                           #19.31
>         vmulpd    %ymm2, %ymm0, %ymm5                           #19.31
>         vmovupd   %ymm5, (%rcx,%r9,8)                           #19.5
>         addq      $4, %r9                                       #11.3
>         cmpq      %r8, %r9                                      #11.3
>         jb        ..B1.4        # Prob 82%                      #11.3