Re: "float complex" arithmetic performance much slower than expected

Michele Martone <michele.martone@xxxxxxxxxx> · Wed, 6 Mar 2013 22:13:36 +0100

On 20130306@20:49, Alexander Monakov wrote:
> 
> 
> On Wed, 6 Mar 2013, Michele Martone wrote:
> 
> 
> >  . CFLAGS for gcc:
> > "-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
> >  . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
> 
> This makes a comparison "unfair" since the two compilers use different
> optimization restrictions for floating-point operations by default (GCC is
> conservative, and thus more restricted in optimizations).  See the
> documentation for -ffast-math GCC option, and floating-point flags in the ICC
> help (e.g. options -fp-model and -mp).

Alexander, Tim.

I did some experiments by following your suggestions. 
First, adding -ffast-math to gcc's CFLAGS.
Then, adding -complex-limited to icc's CFLAGS.

I must say that my functions exhibit only operations as integer and
floating point array access, then on the floating point numbers only sums
and products.

In the following, the percentage gap relative to the MKL routine.
So, it indicates how much of "performance" -- so, inversely proportional
to time -- is missing to each MKL case.

OG=the original gcc's CFLAGS above (former results)
OI=the original icc's flags above  (former results)
FM=gcc CFLAGS with added -ffast-math (based on Alexander's suggestion)
CL=icc CFLAGS with added -complex-limited

     [GCC] [ICC]
     FM OG OI CL
 FD:  5 20 20 24
 FS:  5 35 35 37
 FZ: 26 60 20 22
 FC: 54 90 35 37

Interpreting these results:
1) 
 Adding -ffast-math to gcc's CFLAGS (above, from OG to FM) leads to
 a dramatic speedup:

 of ~19% (from 80% of MKL speed to 95% of that) for FD
 of ~46% (from 65% of MKL speed to 95% of that) for FS
 of ~85% (from 40% of MKL speed to 74% of that) for FZ
 of 360% (from 10% of MKL speed to 46% of that) for FC
 And here, the kick to the "float complex" is striking.

2) FM (gcc -ffast-math) outperforms CL (icc -O3 -complex-limited).
   May this mean that icc is still used unfairly ?
   Probably I shall "downgrade" gcc's flags ?!

My initial goal was twofold:

 - compare my code icc to gcc
 - compare my code to MKL (but in multithreaded mode, so it's out of
   this topic)
And in both cases, using "reasonable" level of optimization, not an
extreme one.
Or even better, a "comparable" level, especially when it comes to "my code
vs MKL".

I saw man icc for -mp1; I understand it's for comparisons and trascendentals,
so it does not apply here.
I see man icc mentions '-fp-model=fast=1' (rather than -fp-model=precise) is the
default. '-fp-model=fast=2' is available (but little documented); maybe that 
would push towards similarity with -ffast-math.

Uhm.. I guess a compromise lies somewhere in between gcc's -O3 and -O3 -fast-math ?!