"float complex" arithmetic performance much slower than expected

Michele Martone <michele.martone@xxxxxxxxxx> · Wed, 6 Mar 2013 17:37:49 +0100

Hi,

I have four functions written in C99: FD,FS,FC,FZ.

All implement the same algorithm.

This algorithm reads a handful of integer and floating point numeric
arrays, and updates one floating point numeric array with the results
of the computation.
In the case of FD the floating point numeric array is of "double"; in
the case of FZ is of "double complex";  FS uses "float"; FC uses
"float complex".
Apart from the floating point arrays (and some scalar argument), the
rest of the code is identical.

Assume providing the same input to FD/FS/FC/FZ, except for the type of
the numerical arrays of course.
I compile two versions of the code the code: one with with gcc-4.7.2,
and the other with Intels icc 13.1 .

Now, on the same input:
 FD/gcc takes ~ the same      time as FD/icc
 FS/gcc takes ~ the same      time as FS/icc
 FZ/gcc takes ~ twice the     time of FZ/icc
 FC/gcc takes ~ six times the time of FC/icc

In other words, my experiments suggest that the my "double complex" (or,
"double _Complex") code is quite slower when compiled with gcc.
And the implementation for "float complex" seems even slower.

Some additional details:

 . Executions here are 'single threaded'
 . performed on an Intel's Sandy Bridge CPU
 . The {FD,FZ,FS,FC} share the same source file
 . CFLAGS for gcc:
"-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
 . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
 . together, these functions are some 160 lines long (so, short) 
 . I'm using loop unrolling in the code
 . argument arrays are specified as e.g.: "double complex * restrict x"
 . if I were to run ~3-4 instances of any of the above routine in parallel,
   the memory bandwidth of the CPU would be saturated.

Now, one may argue about the "optimality" of my implementation of the
four above routines. Regarding this, I also benchmarked an
implementation of the same algorithm from the Intel's MKL library.
One may assume that MKL is "highly optimized":

So, with regards to Intel's implementation:
 FD/icc and FZ/icc are ~20% slower than the MKL counterpart
 FS/icc and FC/icc are ~35% slower than the MKL counterpart

But the gcc-compiled one:
 FD/gcc is ~20% slower than the MKL counterpart
 FS/gcc is ~35% slower than the MKL counterpart
 FZ/gcc is ~60% slower than the MKL counterpart (!)
 FC/gcc is ~90% slower than the MKL counterpart (!!)

So it seems like the "float complex" compiled code is much slower wich
gcc than with icc, while this is not so for other integral types.

Do you find this consistent with your experience in "complex" and gcc,
or it may be the case I am ignoring some basic rule in using gcc ?