On 20130306@20:49, Alexander Monakov wrote: > > > On Wed, 6 Mar 2013, Michele Martone wrote: > > > > . CFLAGS for gcc: > > "-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops" > > . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0" > > This makes a comparison "unfair" since the two compilers use different > optimization restrictions for floating-point operations by default (GCC is > conservative, and thus more restricted in optimizations). See the > documentation for -ffast-math GCC option, and floating-point flags in the ICC > help (e.g. options -fp-model and -mp). Alexander, Tim. I did some experiments by following your suggestions. First, adding -ffast-math to gcc's CFLAGS. Then, adding -complex-limited to icc's CFLAGS. I must say that my functions exhibit only operations as integer and floating point array access, then on the floating point numbers only sums and products. In the following, the percentage gap relative to the MKL routine. So, it indicates how much of "performance" -- so, inversely proportional to time -- is missing to each MKL case. OG=the original gcc's CFLAGS above (former results) OI=the original icc's flags above (former results) FM=gcc CFLAGS with added -ffast-math (based on Alexander's suggestion) CL=icc CFLAGS with added -complex-limited [GCC] [ICC] FM OG OI CL FD: 5 20 20 24 FS: 5 35 35 37 FZ: 26 60 20 22 FC: 54 90 35 37 Interpreting these results: 1) Adding -ffast-math to gcc's CFLAGS (above, from OG to FM) leads to a dramatic speedup: of ~19% (from 80% of MKL speed to 95% of that) for FD of ~46% (from 65% of MKL speed to 95% of that) for FS of ~85% (from 40% of MKL speed to 74% of that) for FZ of 360% (from 10% of MKL speed to 46% of that) for FC And here, the kick to the "float complex" is striking. 2) FM (gcc -ffast-math) outperforms CL (icc -O3 -complex-limited). May this mean that icc is still used unfairly ? Probably I shall "downgrade" gcc's flags ?! My initial goal was twofold: - compare my code icc to gcc - compare my code to MKL (but in multithreaded mode, so it's out of this topic) And in both cases, using "reasonable" level of optimization, not an extreme one. Or even better, a "comparable" level, especially when it comes to "my code vs MKL". I saw man icc for -mp1; I understand it's for comparisons and trascendentals, so it does not apply here. I see man icc mentions '-fp-model=fast=1' (rather than -fp-model=precise) is the default. '-fp-model=fast=2' is available (but little documented); maybe that would push towards similarity with -ffast-math. Uhm.. I guess a compromise lies somewhere in between gcc's -O3 and -O3 -fast-math ?!