gcc fails to vectorise the product of a complex array

Raphael C <drraph@xxxxxxxxx> · Sun, 15 Jan 2017 22:50:08 +0000

Consider this simple piece of code which takes the product of an array
of complex numbers.

#include <complex.h>
complex float f(complex float x[]) {
  complex float p = 1.0;
  for (int i = 0; i < 32; i++)
    p *= x[i];
  return p;
}

If I compile it with -O3 -march=bdver2 -ffast-math  I get

f:
        vmovss  xmm2, DWORD PTR .LC1[rip]
        vxorps  xmm1, xmm1, xmm1
        lea     rax, [rdi+256]
.L2:
        vmovss  xmm0, DWORD PTR [rdi+4]
        add     rdi, 8
        vmulss  xmm3, xmm0, xmm2
        vmulss  xmm0, xmm0, xmm1
        vfmadd132ss     xmm1, xmm3, DWORD PTR [rdi-8]
        vfmsub132ss     xmm2, xmm0, DWORD PTR [rdi-8]
        cmp     rax, rdi
        jne     .L2
        vmovss  DWORD PTR [rsp-8], xmm2
        vmovss  DWORD PTR [rsp-4], xmm1
        vmovq   xmm0, QWORD PTR [rsp-8]
        ret
.LC1:
        .long   1065353216

That is unvectorised assembly.  This is with gcc version 7 (snapshot)
but earlier versions give similar results.

However if I do the same thing with float instead of complex float I get:

f(float*):
        vmovups xmm2, XMMWORD PTR [rdi]
        vmulps  xmm0, xmm2, XMMWORD PTR [rdi+16]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+32]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+48]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+64]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+80]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+96]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi+112]
        vpsrldq xmm1, xmm0, 8
        vmulps  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vmulps  xmm0, xmm0, xmm1
        ret

This is now vectorised code.

Is there any way to persuade gcc to vectorise the complex version?  My
ultimate goal is to get efficient AVX code for this function.

As a test I also tried icc (the Intel Compiler) which does appear to
give vectorised code so it is at least possible in principle.

Raphael