Most likely ICC was able to interchange inner and outer loops and then vectorize the inner one. GCC manages to do the vectorization if you interchange loops manually, unfortunately it's neither a small change nor a memory-efficient one, unless you also split the inner loop into 32-byte chunks (that would be 3 nested loops and basically the same job as manual vectorization). I'd expect GCC to interchange loops given the -floop-interchange option but it doesn't seem to happen even in the simpler case: for (int i=0;i<m;i++){ for (int k=n;k>=0;k--){ u2[i] = u1[i]; u1[i] = u0[i]; u0[i] = 2*xs[i]*u1[i]-u2[i]+coeffs[k] } } Probably the -floop-interchange option is only about the cache efficiency, I'm not sure. This looks like a missed optimization opportunity so I'd file a bugzilla tracker. On Wed, May 10, 2017 at 9:31 AM, Jyotirmoy Bhattacharya <jyotirmoy@xxxxxxxxxxxxx> wrote: > I have the following C++ code that evaluates a Chebyshev polynomial > using Clenshaw's algorithm > > void cheby_eval(double *coeffs,int n,double *xs,double *ys,int m) > { > #pragma omp simd > for (int i=0;i<m;i++){ > double x = xs[i]; > double u0=0,u1=0,u2=0; > for (int k=n;k>=0;k--){ > u2 = u1; > u1 = u0; > u0 = 2*x*u1-u2+coeffs[k]; > } > ys[i] = 0.5*(coeffs[0]+u0-u2); > } > } > > I'm hoping for an autovectorization of the outer loop so that the > inner loop operates on vectors. > > When compiled with > > g++ -march=haswell -O3 -fopt-info-vec-missed -S chebyshev.cc > > using g++ 6.3.0, no vectorization happens I get the messages > > chebyshev.cc:11:17: note: not vectorized: control flow in loop. > chebyshev.cc:11:17: note: bad loop form. > chebyshev.cc:14:19: note: intermediate value used outside loop. > chebyshev.cc:14:19: note: Unknown def-use cycle pattern. > chebyshev.cc:14:19: note: reduction used in loop. > chebyshev.cc:14:19: note: Unknown def-use cycle pattern. > chebyshev.cc:14:19: note: Unsupported pattern. > chebyshev.cc:14:19: note: Unsupported pattern. > chebyshev.cc:14:19: note: not vectorized: unsupported use in stmt. > chebyshev.cc:14:19: note: unexpected pattern. > chebyshev.cc:11:17: note: not vectorized: not enough data-refs in basic block. > chebyshev.cc:21:1: note: not vectorized: not enough data-refs in basic block. > chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block. > chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block. > chebyshev.cc:14:19: note: not vectorized: not enough data-refs in basic block. > chebyshev.cc:11:17: note: not consecutive access _27 = *coeffs_20(D); > chebyshev.cc:11:17: note: not vectorized: no grouped stores in basic block. > > On the same code icc vectorizes the outer loop as expected. > > I was wondering if there are small ways in which I can change my code > to help gcc's autovectorizer to succeed. I would also appreciate any > pointers to documentation or gcc source that can help me better > understand how gcc's autovectorization of outer loops works. > > Regards, > Jyotirmoy Bhattacharya > > PS. The interesting part of icc's assembler output is > > ..B1.4: # Preds ..B1.8 ..B1.3 > xorl %r15d, %r15d #14.5 > xorl %ebx, %ebx #14.21 > testq %rsi, %rsi #14.21 > vmovupd (%rdx,%r9,8), %ymm3 #12.16 > vxorpd %ymm5, %ymm5, %ymm5 #13.14 > vmovdqa %ymm1, %ymm4 #13.19 > vmovdqa %ymm1, %ymm2 #13.24 > jl ..B1.8 # Prob 2% #14.21 > > ..B1.5: # Preds ..B1.4 > vaddpd %ymm3, %ymm3, %ymm3 #17.14 > > ..B1.6: # Preds ..B1.6 ..B1.5 > vmovapd %ymm4, %ymm2 #20.3 > incq %r15 #14.5 > vmovapd %ymm5, %ymm4 #20.3 > vfmsub213pd %ymm2, %ymm3, %ymm5 #17.19 > vbroadcastsd (%r11,%rbx,8), %ymm6 #17.22 > decq %rbx > vaddpd %ymm5, %ymm6, %ymm5 #17.22 > cmpq %r10, %r15 #14.5 > jb ..B1.6 # Prob 82% #14.5 > > ..B1.8: # Preds ..B1.6 ..B1.4 > vbroadcastsd (%rdi), %ymm3 #19.18 > vaddpd %ymm3, %ymm5, %ymm4 #19.28 > vsubpd %ymm2, %ymm4, %ymm2 #19.31 > vmulpd %ymm2, %ymm0, %ymm5 #19.31 > vmovupd %ymm5, (%rcx,%r9,8) #19.5 > addq $4, %r9 #11.3 > cmpq %r8, %r9 #11.3 > jb ..B1.4 # Prob 82% #11.3