Re: gcc 3.4.3: -march optimization for Intel Core2Duo

Ryan Hill <dirtyepic.sk@xxxxxxxxx> · Sat, 07 Oct 2006 00:36:10 -0600

Timothy C Prince wrote:
> From: Vladimir Makarov <vmakarov@xxxxxxxxxx> Ian Lance Taylor wrote:
>> "Jan Dillmann" <jan.dillmann@xxxxxxxxxx> writes:

>>> we are running several benchmarks (SpecCPU200...) on 32-bit
>>> linux-systems and are able to set an optimization-parameter for
>>> '-march'. We use Intel Core2uo-CPUs. Which parameter should we
>>> use (nocona, prescott...) ?
>>> 
>>> 
>> gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other
>> reason than the release was made before the processors became
>> available.  My guess would be that you will get the best results
>> with -mtune=nocona. But it is only a guess.
>> 
>> 
>> 
> I believe that pentium-m will work better.  Nocona (a x86_64
> processor) is based on northwood/prescott core which is a high
> frequency core with long pipelines.  Core2 Duo is closer to pentium M
> (lower frequency core with much shorter pipelines).  Although usage
> of penium-m will result a bigger code in comparison with nocona
> because aligning loop/function will be forced (nortwood core is not
> so sensitive to aligning therfore aligining is not done when
> -mtune=nocona is used).  I don't remember Intel recomendation about
> aligning code for Core Duo (probably it is the same as for penium M).
> 
> 
> 
> ________________________________
> 
> FWIW, pentium-m is optimized by using 387 code for nearly everything
> except (int) casts. This is because of the Banias SSE decoder
> bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to
> attempt to get code more optimum for any CPU other than
> Banias/Dothan. OP question was about Core 2 Duo, a more advanced
> (64-bit capable) CPU than Core Duo. Tim Prince
> 

I did some very unscientific and limited benchmarking of GCC trunk
performance in tramp3d.  Any analysis or suggestions, and other
benchmark numbers would be greatly appreciated if you can supply them.
I plan to do more when I have some time.

Also, this is on a Core Duo rather than a Core 2 Duo.  Does that make a
significant difference?

----------
(http://forums.gentoo.org/viewtopic-p-3602555.html#3601332)

ok, i did one simple c++ benchmark using TraMP3d-v4. keep in mind it's
just one benchmark.

the system used was a Toshiba Satellite A100 laptop with a Core Duo
T2300 @ 1.66GHz (Yonah), 2MiB shared L2 cache, and 1GiB of memory. the
GCC version used was 4.1-branch svn built yesterday.

[-O2 -march=prescott -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=prescott -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-prescott
95.45user 0.84system 1:35.69elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+202080minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-prescott -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.07124s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.946142s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.966466s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.975241s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.97465s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.985882s/it)
i = 7    t = 0.013681    dt = 0.00193011 (1.0047s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (1.00467s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.00171s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.0184s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.01445s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.00954s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.01139s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.01714s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.01269s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.01118s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.00748s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (1.00433s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (1.00053s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (1.00177s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (0.989871s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.997915s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.98717s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.989589s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.987146s/it)
Time spent in iteration: 23.9913
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-prescott
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-prescott, please wait...
i486:    0 i586:    0 ppro:  130 mmx:    0 sse:    0 sse2:    0 sse3:    2
tramp3d-v4-prescott will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.

[-O2 -march=pentium-m -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-plain
97.74user 0.74system 1:38.47elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (11major+200253minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-plain -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.0692961s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.992859s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (1.0033s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.975363s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.98926s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.986304s/it)
i = 7    t = 0.013681    dt = 0.00193011 (0.997433s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (0.99804s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.00585s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.00463s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.01035s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.00643s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.00908s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.00359s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.00683s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.0018s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.00395s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (0.994894s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (0.995252s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (0.992024s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (0.989914s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.984155s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.986609s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.981239s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.986686s/it)
Time spent in iteration: 23.9751
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-plain
                                                               Checking
vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-plain, please wait...
i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    4 sse3:    0
tramp3d-v4-pentiumm-plain will run on Pentium IV (pentium4) or higher
processor.

[-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-pentiumm
97.73user 1.01system 1:38.05elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+197280minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.069342s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.968165s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.985737s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.999364s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (1.01105s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (1.01161s/it)
i = 7    t = 0.013681    dt = 0.00193011 (1.02449s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (1.02412s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.02851s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.02592s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.03424s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.0353s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.03373s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.03266s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.03526s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.02011s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.0232s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (1.02476s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (1.0153s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (1.00431s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (1.00313s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.989761s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.99909s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.989536s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.996134s/it)
Time spent in iteration: 24.3848
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm, please wait...
i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    0 sse3:    2
tramp3d-v4-pentiumm will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.

[-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe
-Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-sse
98.40user 0.94system 1:39.15elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (3major+198438minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-sse -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.0617449s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.897831s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.964484s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.94189s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.972172s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.973818s/it)
i = 7    t = 0.013681    dt = 0.00193011 (0.984364s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (0.988743s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (0.996885s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (0.997118s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.00016s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (0.99685s/it)
i = 13   t = 0.0263471   dt = 0.002275 (0.998231s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.00025s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (0.987068s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (0.981898s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (0.990963s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (0.986071s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (0.980461s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (0.982345s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (1.00055s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.995297s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (1.00189s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (1.00527s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (1.01299s/it)
Time spent in iteration: 23.6994
Correctness:
        sum(rh) difference = 1.28966e-08
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-sse
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-sse, please wait...
i486:    0 i586:    0 ppro:   84 mmx:   44 sse:    0 sse2: 3089 sse3:    0
tramp3d-v4-pentiumm-sse will run on Pentium IV (pentium4) or higher
processor.

Keep in mind that anything that does strip-flags (ie. GCC, glibc,
kernel, etc.) will remove both -msse3 and -mfpmath from your C[XX]FLAGS

Very little difference in runtimes, maybe half a second, and next to no
difference in compile time. Surprisingly,
-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe was the slowest.
I reran the test to be sure and it was slightly worse (24.5397s) than
the original run.

It also appears -mfpmath=sse does not generate sse3 instructions.

----------

I also tested "-O2 -march=prescott -mfpmath=sse -fomit-frame-pointer
-pipe".  I forgot to record the results, but the times were a small
improvement over "-O2 -march=pentium-m -msse3 -mfpmath=sse
-fomit-frame-pointer -pipe", somewhere btwn 22.8 and 23.2s if I remember
correctly.

--de.

Attachment:
signature.asc

Description: OpenPGP digital signature