Re: "float complex" arithmetic performance much slower than expected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/6/2013 4:13 PM, Michele Martone wrote:
On 20130306@20:49, Alexander Monakov wrote:

On Wed, 6 Mar 2013, Michele Martone wrote:


  . CFLAGS for gcc:
"-O3 -pipe -march=native -mtune=native -mavx -std=c99 -fno-unroll-loops"
  . CFLAGS for icc: "-O3 -xAVX -restrict -unroll=0"
This makes a comparison "unfair" since the two compilers use different
optimization restrictions for floating-point operations by default (GCC is
conservative, and thus more restricted in optimizations).  See the
documentation for -ffast-math GCC option, and floating-point flags in the ICC
help (e.g. options -fp-model and -mp).
Alexander, Tim.

I did some experiments by following your suggestions.
First, adding -ffast-math to gcc's CFLAGS.
Then, adding -complex-limited to icc's CFLAGS.

I must say that my functions exhibit only operations as integer and
floating point array access, then on the floating point numbers only sums
and products.

In the following, the percentage gap relative to the MKL routine.
So, it indicates how much of "performance" -- so, inversely proportional
to time -- is missing to each MKL case.

OG=the original gcc's CFLAGS above (former results)
OI=the original icc's flags above  (former results)
FM=gcc CFLAGS with added -ffast-math (based on Alexander's suggestion)
CL=icc CFLAGS with added -complex-limited

      [GCC] [ICC]
      FM OG OI CL
  FD:  5 20 20 24
  FS:  5 35 35 37
  FZ: 26 60 20 22
  FC: 54 90 35 37

Interpreting these results:
1)
  Adding -ffast-math to gcc's CFLAGS (above, from OG to FM) leads to
  a dramatic speedup:

  of ~19% (from 80% of MKL speed to 95% of that) for FD
  of ~46% (from 65% of MKL speed to 95% of that) for FS
  of ~85% (from 40% of MKL speed to 74% of that) for FZ
  of 360% (from 10% of MKL speed to 46% of that) for FC
  And here, the kick to the "float complex" is striking.

2) FM (gcc -ffast-math) outperforms CL (icc -O3 -complex-limited).
    May this mean that icc is still used unfairly ?
    Probably I shall "downgrade" gcc's flags ?!
My initial goal was twofold:

  - compare my code icc to gcc
  - compare my code to MKL (but in multithreaded mode, so it's out of
    this topic)
And in both cases, using "reasonable" level of optimization, not an
extreme one.
Or even better, a "comparable" level, especially when it comes to "my code
vs MKL".

I saw man icc for -mp1; I understand it's for comparisons and trascendentals,
so it does not apply here.
I see man icc mentions '-fp-model=fast=1' (rather than -fp-model=precise) is the
default. '-fp-model=fast=2' is available (but little documented); maybe that
would push towards similarity with -ffast-math.

Uhm.. I guess a compromise lies somewhere in between gcc's -O3 and -O3 -fast-math ?!
icc -fp-model fast=1 -complex-limited-range is similar to gcc -ffast-math, depending on gcc version. icc -fp-model fast=2 includes -complex-limited-range but it may be excessively aggressive beyond that. As you say, you take your chances on possible unexpected effects. icc -mp1 is an obsolete option from before the advent of the -fp-model options. It cut back on some unnecessarily risky x87 shortcuts. About all I know of you can do to cut back while keeping the most desirable parts of -ffast-math is to add -fno-cx-limited-range depending on whether that is desirable (as well as safer) for your application. In my opinion, the most objectionable and unnecessary feature of both gcc -ffast=math and icc -fp-model fast is the K&R style relaxation of the standard on parentheses. If it were not for the desirability of sum reduction optimization, I would not use these options.

--
Tim Prince



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux