Data and common sense about PIV optimizations, Gcc and the Intelcompiler;

Jean Francois Martinez <jfm2@xxxxxxxxxxxxxxxx> · 13 Dec 2002 22:19:56 +0100

>From benchmarks and with Gcc:

1) Most of the optimization (at least 80% of it) doesn't come from
processor specific instructions but from selecting alternatives
who are better to a specific processor.  On a PIII you gain around
10% when using -mcpu=i686 instead of -mcpu=i386 (mcpu=i686 means
the compiler will select the faster sequences for the PIII family
but will use only 386 instructions) while using -march=i686 gains
only a paltry 2% over -mcpu=i686.  I don't know if this because there
is little to be gained or because gcc does a bad job

2) RedHat compiles most packages with -mcpu=i686 (except for software
who has parts in assembler like kernel and glibc, for these you get
a processor specific package compiled with -march=).  Thus if you are
using a PIV and recompile with -march=pentium4 you will gain probably
little from the -mcpu=pentium4 to -march=pentium4 part (ie from use
specific instructions) since use of processor-specific instructiions is
nearly irrelevant in the far more mature i686 optimizer.  For the
-mcpu=i686 to -mcpu=pentium4 it would be nice if someone ran a few
benchmarks: for one
part the Pentium 4 is reported to be highly sensitive to exact
instruction ordering (much more than a PIII) but for another part the
PIV optimizer in gcc  is quite young and I would bet it doesn't do an
outstanding job.

3) The above discussion does not refer to use of MMX/SSE instructions.
I benchmarked them and seemed to produce zero difference.  However
a) my benchmark was probably not adequate for exerting them and b) the
processors I have access to are slow when going from MMX to SSE mode
and back so you need long sequences of MMX instructions in order to
recover the "investment".  In newer processors like the Athlon XP, there
is nearly zero overhead for mode switching so you would probably get
better results for MMX/SSE with an Athlon or a PIV. 

Gcc 3.2 versus Intel compiler.

1) Benchmarks compiled with ICC seem to be 30 to 40% faster than with
Gcc.  However a) they are much bigger (double size or more), ICC seems
to do a LOT of inlining and b) when you read Icc's doc you notice that
Icc does function-inlining at -O1 level of optimization while Gcc does
not use optimizations who have harmful effects at -O2 or below.  Since
function-inlining makes code bigger gcc does not use it at -O2 you have
to use -O3. Thus the only valid comparison is Icc -O1  versus gcc -O3. 
At those settings Gcc code ran nearly as fast as Icc's.  In some tests
it was even faster.  Code was larger than with gcc -O2 but still much
smaller than Icc's.

2) Using optimization levels above -O1 seems to have zero effect with
Icc.  In gcc the "stopping point" is at -O3: beyond it gains are
very small.

3) With Icc you can turn the flags for interprocedural  optimizations.
These made my benchmarks run still 20 or 30% faster abiove the base
result.   There is no
combination of flags in gcc allowing to even touch the level of
performance you get with Icc when interprocedural optimizations are
turned on an still less when you allow optimizations across files.
However both of these make Icc code significantly larger (remember it
was already very large).  That is why while interprocedural
optimizations are great for benchmarks I am not so
sure it would be a good idea for say, StarOffice, since there is a good
chance the much larger Icc binaries running out of cache or TLB entries
and that would cause a slowdown far larger than the acceleration of
Icc's better code.

A bit of common sense.

Frankly I am a bit annoyed when I read the hype about Gentoo or LFS and
how you will get precisely tuned binaries who will cure cancer and bring
peace on earth. IMHO this is drivel for mathematically impaired  people.
At least if you are using a PIII.  Let's remember that RedHat's ordinary
binaries are only at 2% of the maximum you can get (by recompiling with
-march=i686) and that special binaries (eg kernel and glibc) are already
compiled with full optimization.  What that means?  Let's say your box
spends a day recompiling the distribution.  You will only recover your
investment after 50 days.  Nearly two months.  But this assumes a) it
spends these two months doing pure number crunching, no disk activity,
no wait for user input   b) it spends zero time in glibc or kernel
(except for clock ticks) since original glibc and kernel are already
compiled with full optimizations.  In a realistic scenario you will
never recover your investment before upgrade day.

I don't know for PIV.  For Athlon I can only do an educated guess: AMD
knows well most of the time its processors will be running code who has
been optimized for Intel ones AMD cannot make processors whose
performance crumbles if sequence is not exactly optimized for them.
So AMD processors have either to be agnostic (ie sequnce A and sequence
B are equally fast on them) or haves speed tables close to the speed
table of their main Intel rival (ie if A is faster than B on Intel, AMD
will ensure it is also faster on Athlon).  That is why I doubt compiling
specifically for Athlon makes code much faster than the optimized for
PIII code shipped by RedHat.  It also depends on if gcc is really good
at  optimizing for Athlon.  And that is a big if.

Anyone willing to run a few benchmarks on an Athlon or a PIV?

			JFM

_______________________________________________
Redhat-devel-list mailing list
Redhat-devel-list@redhat.com
https://listman.redhat.com/mailman/listinfo/redhat-devel-list