Re: "float complex" arithmetic performance much slower than expected

Tim Prince <n8tm@xxxxxxx> · Wed, 06 Mar 2013 20:07:09 -0500

On 3/6/2013 4:13 PM, Michele Martone wrote:
On 20130306@20:49, Alexander Monakov wrote:

On Wed, 6 Mar 2013, Michele Martone wrote:

  . CFLAGS for gcc:
"-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
  . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
This makes a comparison "unfair" since the two compilers use different
optimization restrictions for floating-point operations by default (GCC is
conservative, and thus more restricted in optimizations).  See the
documentation for -ffast-math GCC option, and floating-point flags in the ICC
help (e.g. options -fp-model and -mp).
Alexander, Tim.

I did some experiments by following your suggestions.
First, adding -ffast-math to gcc's CFLAGS.
Then, adding -complex-limited to icc's CFLAGS.

I must say that my functions exhibit only operations as integer and
floating point array access, then on the floating point numbers only sums
and products.

In the following, the percentage gap relative to the MKL routine.
So, it indicates how much of "performance" -- so, inversely proportional
to time -- is missing to each MKL case.

OG=the original gcc's CFLAGS above (former results)
OI=the original icc's flags above  (former results)
FM=gcc CFLAGS with added -ffast-math (based on Alexander's suggestion)
CL=icc CFLAGS with added -complex-limited

      [GCC] [ICC]
      FM OG OI CL
  FD:  5 20 20 24
  FS:  5 35 35 37
  FZ: 26 60 20 22
  FC: 54 90 35 37

Interpreting these results:
1)
  Adding -ffast-math to gcc's CFLAGS (above, from OG to FM) leads to
  a dramatic speedup:

  of ~19% (from 80% of MKL speed to 95% of that) for FD
  of ~46% (from 65% of MKL speed to 95% of that) for FS
  of ~85% (from 40% of MKL speed to 74% of that) for FZ
  of 360% (from 10% of MKL speed to 46% of that) for FC
  And here, the kick to the "float complex" is striking.

2) FM (gcc -ffast-math) outperforms CL (icc -O3 -complex-limited).
    May this mean that icc is still used unfairly ?
    Probably I shall "downgrade" gcc's flags ?!

My initial goal was twofold:

  - compare my code icc to gcc
  - compare my code to MKL (but in multithreaded mode, so it's out of
    this topic)
And in both cases, using "reasonable" level of optimization, not an
extreme one.
Or even better, a "comparable" level, especially when it comes to "my code
vs MKL".

I saw man icc for -mp1; I understand it's for comparisons and trascendentals,
so it does not apply here.
I see man icc mentions '-fp-model=fast=1' (rather than -fp-model=precise) is the
default. '-fp-model=fast=2' is available (but little documented); maybe that
would push towards similarity with -ffast-math.

Uhm.. I guess a compromise lies somewhere in between gcc's -O3 and -O3 -fast-math ?!
icc  -fp-model fast=1 -complex-limited-range is similar to gcc 
-ffast-math, depending on gcc version.  icc -fp-model fast=2 includes 
-complex-limited-range but it may be excessively aggressive beyond 
that.  As you say, you take your chances on possible unexpected effects.
icc -mp1 is an obsolete option from before the advent of the -fp-model 
options.  It cut back on some unnecessarily risky x87 shortcuts.
About all I know of you can do to cut back while keeping the most 
desirable parts of -ffast-math is to add  -fno-cx-limited-range 
depending on whether that is desirable (as well as safer) for your 
application.
In my opinion, the most objectionable and unnecessary feature of both 
gcc -ffast=math and icc -fp-model fast is the K&R style relaxation of 
the standard on parentheses.  If it were not for the desirability of sum 
reduction optimization, I would not use these options.

--
Tim Prince