Hi, I have four functions written in C99: FD,FS,FC,FZ. All implement the same algorithm. This algorithm reads a handful of integer and floating point numeric arrays, and updates one floating point numeric array with the results of the computation. In the case of FD the floating point numeric array is of "double"; in the case of FZ is of "double complex"; FS uses "float"; FC uses "float complex". Apart from the floating point arrays (and some scalar argument), the rest of the code is identical. Assume providing the same input to FD/FS/FC/FZ, except for the type of the numerical arrays of course. I compile two versions of the code the code: one with with gcc-4.7.2, and the other with Intels icc 13.1 . Now, on the same input: FD/gcc takes ~ the same time as FD/icc FS/gcc takes ~ the same time as FS/icc FZ/gcc takes ~ twice the time of FZ/icc FC/gcc takes ~ six times the time of FC/icc In other words, my experiments suggest that the my "double complex" (or, "double _Complex") code is quite slower when compiled with gcc. And the implementation for "float complex" seems even slower. Some additional details: . Executions here are 'single threaded' . performed on an Intel's Sandy Bridge CPU . The {FD,FZ,FS,FC} share the same source file . CFLAGS for gcc: "-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops" . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0" . together, these functions are some 160 lines long (so, short) . I'm using loop unrolling in the code . argument arrays are specified as e.g.: "double complex * restrict x" . if I were to run ~3-4 instances of any of the above routine in parallel, the memory bandwidth of the CPU would be saturated. Now, one may argue about the "optimality" of my implementation of the four above routines. Regarding this, I also benchmarked an implementation of the same algorithm from the Intel's MKL library. One may assume that MKL is "highly optimized": So, with regards to Intel's implementation: FD/icc and FZ/icc are ~20% slower than the MKL counterpart FS/icc and FC/icc are ~35% slower than the MKL counterpart But the gcc-compiled one: FD/gcc is ~20% slower than the MKL counterpart FS/gcc is ~35% slower than the MKL counterpart FZ/gcc is ~60% slower than the MKL counterpart (!) FC/gcc is ~90% slower than the MKL counterpart (!!) So it seems like the "float complex" compiled code is much slower wich gcc than with icc, while this is not so for other integral types. Do you find this consistent with your experience in "complex" and gcc, or it may be the case I am ignoring some basic rule in using gcc ?