Re: "float complex" arithmetic performance much slower than expected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/6/2013 11:37 AM, Michele Martone wrote:
Hi,

I have four functions written in C99: FD,FS,FC,FZ.

All implement the same algorithm.

This algorithm reads a handful of integer and floating point numeric
arrays, and updates one floating point numeric array with the results
of the computation.
In the case of FD the floating point numeric array is of "double"; in
the case of FZ is of "double complex";  FS uses "float"; FC uses
"float complex".
Apart from the floating point arrays (and some scalar argument), the
rest of the code is identical.

Assume providing the same input to FD/FS/FC/FZ, except for the type of
the numerical arrays of course.
I compile two versions of the code the code: one with with gcc-4.7.2,
and the other with Intels icc 13.1 .

Now, on the same input:
  FD/gcc takes ~ the same      time as FD/icc
  FS/gcc takes ~ the same      time as FS/icc
  FZ/gcc takes ~ twice the     time of FZ/icc
  FC/gcc takes ~ six times the time of FC/icc

In other words, my experiments suggest that the my "double complex" (or,
"double _Complex") code is quite slower when compiled with gcc.
And the implementation for "float complex" seems even slower.

Some additional details:

  . Executions here are 'single threaded'
  . performed on an Intel's Sandy Bridge CPU
  . The {FD,FZ,FS,FC} share the same source file
  . CFLAGS for gcc:
"-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
  . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
  . together, these functions are some 160 lines long (so, short)
  . I'm using loop unrolling in the code
  . argument arrays are specified as e.g.: "double complex * restrict x"
  . if I were to run ~3-4 instances of any of the above routine in parallel,
    the memory bandwidth of the CPU would be saturated.

Now, one may argue about the "optimality" of my implementation of the
four above routines. Regarding this, I also benchmarked an
implementation of the same algorithm from the Intel's MKL library.
One may assume that MKL is "highly optimized":

So, with regards to Intel's implementation:
  FD/icc and FZ/icc are ~20% slower than the MKL counterpart
  FS/icc and FC/icc are ~35% slower than the MKL counterpart

But the gcc-compiled one:
  FD/gcc is ~20% slower than the MKL counterpart
  FS/gcc is ~35% slower than the MKL counterpart
  FZ/gcc is ~60% slower than the MKL counterpart (!)
  FC/gcc is ~90% slower than the MKL counterpart (!!)

So it seems like the "float complex" compiled code is much slower wich
gcc than with icc, while this is not so for other integral types.


Do you find this consistent with your experience in "complex" and gcc,
or it may be the case I am ignoring some basic rule in using gcc ?

In the absence of -fcx-limited-range, gcc may protect divide and sqrt by using library functions, where icc would simply widen to double. You would see any such library function usage if you profiled by gprof, at least when the library is static linked. Also, the library functions used by gcc aren't vectorized, while icc would go further toward promoting vectorization by in-lining code or calling vector math functions. Vectorization reports for both compilers would shed light on this question.

--
Tim Prince



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux