Timothy C Prince wrote: > From: Vladimir Makarov <vmakarov@xxxxxxxxxx> Ian Lance Taylor wrote: >> "Jan Dillmann" <jan.dillmann@xxxxxxxxxx> writes: >>> we are running several benchmarks (SpecCPU200...) on 32-bit >>> linux-systems and are able to set an optimization-parameter for >>> '-march'. We use Intel Core2uo-CPUs. Which parameter should we >>> use (nocona, prescott...) ? >>> >>> >> gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other >> reason than the release was made before the processors became >> available. My guess would be that you will get the best results >> with -mtune=nocona. But it is only a guess. >> >> >> > I believe that pentium-m will work better. Nocona (a x86_64 > processor) is based on northwood/prescott core which is a high > frequency core with long pipelines. Core2 Duo is closer to pentium M > (lower frequency core with much shorter pipelines). Although usage > of penium-m will result a bigger code in comparison with nocona > because aligning loop/function will be forced (nortwood core is not > so sensitive to aligning therfore aligining is not done when > -mtune=nocona is used). I don't remember Intel recomendation about > aligning code for Core Duo (probably it is the same as for penium M). > > > > ________________________________ > > FWIW, pentium-m is optimized by using 387 code for nearly everything > except (int) casts. This is because of the Banias SSE decoder > bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to > attempt to get code more optimum for any CPU other than > Banias/Dothan. OP question was about Core 2 Duo, a more advanced > (64-bit capable) CPU than Core Duo. Tim Prince > I did some very unscientific and limited benchmarking of GCC trunk performance in tramp3d. Any analysis or suggestions, and other benchmark numbers would be greatly appreciated if you can supply them. I plan to do more when I have some time. Also, this is on a Core Duo rather than a Core 2 Duo. Does that make a significant difference? ---------- (http://forums.gentoo.org/viewtopic-p-3602555.html#3601332) ok, i did one simple c++ benchmark using TraMP3d-v4. keep in mind it's just one benchmark. the system used was a Toshiba Satellite A100 laptop with a Core Duo T2300 @ 1.66GHz (Yonah), 2MiB shared L2 cache, and 1GiB of memory. the GCC version used was 4.1-branch svn built yesterday. [-O2 -march=prescott -fomit-frame-pointer -pipe] dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=prescott -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp -o tramp3d-v4-prescott 95.45user 0.84system 1:35.69elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+202080minor)pagefaults 0swaps dirtyepic@tycho ~/tmp $ ./tramp3d-v4-prescott -n 25 --cartvis 1.0 0.0 --rhomin 1e-8 Using using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1] solving eeq time increments from [0, 1.79769e+308], cfl 0.5 starting at t = 0, i = 1 cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] face physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] periodic boundaries in X Y Z i = 1 t = 0.00209225 dt = 0.00209225 (0.07124s/it) i = 2 t = 0.00410537 dt = 0.00201312 (0.946142s/it) i = 3 t = 0.00603889 dt = 0.00193352 (0.966466s/it) i = 4 t = 0.00794139 dt = 0.00190251 (0.975241s/it) i = 5 t = 0.00984636 dt = 0.00190497 (0.97465s/it) i = 6 t = 0.0117508 dt = 0.00190449 (0.985882s/it) i = 7 t = 0.013681 dt = 0.00193011 (1.0047s/it) i = 8 t = 0.0156598 dt = 0.0019788 (1.00467s/it) i = 9 t = 0.0176706 dt = 0.00201081 (1.00171s/it) i = 10 t = 0.0197364 dt = 0.0020658 (1.0184s/it) i = 11 t = 0.0218716 dt = 0.0021352 (1.01445s/it) i = 12 t = 0.0240721 dt = 0.00220057 (1.00954s/it) i = 13 t = 0.0263471 dt = 0.002275 (1.01139s/it) i = 14 t = 0.0287159 dt = 0.00236875 (1.01714s/it) i = 15 t = 0.0311533 dt = 0.00243738 (1.01269s/it) i = 16 t = 0.0336768 dt = 0.0025235 (1.01118s/it) i = 17 t = 0.0362863 dt = 0.00260952 (1.00748s/it) i = 18 t = 0.0389715 dt = 0.00268521 (1.00433s/it) i = 19 t = 0.0417381 dt = 0.00276665 (1.00053s/it) i = 20 t = 0.0445873 dt = 0.00284919 (1.00177s/it) i = 21 t = 0.0475216 dt = 0.0029343 (0.989871s/it) i = 22 t = 0.0505258 dt = 0.00300413 (0.997915s/it) i = 23 t = 0.0535938 dt = 0.00306807 (0.98717s/it) i = 24 t = 0.0567043 dt = 0.0031105 (0.989589s/it) i = 25 t = 0.0598233 dt = 0.00311892 (0.987146s/it) Time spent in iteration: 23.9913 Correctness: sum(rh) difference = 1.45519e-11 sum(vx) = -0.242582 sum(vy) = -0.295116 sum(vz) = -0.335474 sum(rh*T) difference = -297.099 dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-prescott Checking vendor_id string... GenuineIntel Disassembling tramp3d-v4-prescott, please wait... i486: 0 i586: 0 ppro: 130 mmx: 0 sse: 0 sse2: 0 sse3: 2 tramp3d-v4-prescott will run on Pentium IV (pentium4) w/ SSE3 or higher processor. [-O2 -march=pentium-m -fomit-frame-pointer -pipe] dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp -o tramp3d-v4-pentiumm-plain 97.74user 0.74system 1:38.47elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (11major+200253minor)pagefaults 0swaps dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-plain -n 25 --cartvis 1.0 0.0 --rhomin 1e-8 Using using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1] solving eeq time increments from [0, 1.79769e+308], cfl 0.5 starting at t = 0, i = 1 cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] face physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] periodic boundaries in X Y Z i = 1 t = 0.00209225 dt = 0.00209225 (0.0692961s/it) i = 2 t = 0.00410537 dt = 0.00201312 (0.992859s/it) i = 3 t = 0.00603889 dt = 0.00193352 (1.0033s/it) i = 4 t = 0.00794139 dt = 0.00190251 (0.975363s/it) i = 5 t = 0.00984636 dt = 0.00190497 (0.98926s/it) i = 6 t = 0.0117508 dt = 0.00190449 (0.986304s/it) i = 7 t = 0.013681 dt = 0.00193011 (0.997433s/it) i = 8 t = 0.0156598 dt = 0.0019788 (0.99804s/it) i = 9 t = 0.0176706 dt = 0.00201081 (1.00585s/it) i = 10 t = 0.0197364 dt = 0.0020658 (1.00463s/it) i = 11 t = 0.0218716 dt = 0.0021352 (1.01035s/it) i = 12 t = 0.0240721 dt = 0.00220057 (1.00643s/it) i = 13 t = 0.0263471 dt = 0.002275 (1.00908s/it) i = 14 t = 0.0287159 dt = 0.00236875 (1.00359s/it) i = 15 t = 0.0311533 dt = 0.00243738 (1.00683s/it) i = 16 t = 0.0336768 dt = 0.0025235 (1.0018s/it) i = 17 t = 0.0362863 dt = 0.00260952 (1.00395s/it) i = 18 t = 0.0389715 dt = 0.00268521 (0.994894s/it) i = 19 t = 0.0417381 dt = 0.00276665 (0.995252s/it) i = 20 t = 0.0445873 dt = 0.00284919 (0.992024s/it) i = 21 t = 0.0475216 dt = 0.0029343 (0.989914s/it) i = 22 t = 0.0505258 dt = 0.00300413 (0.984155s/it) i = 23 t = 0.0535938 dt = 0.00306807 (0.986609s/it) i = 24 t = 0.0567043 dt = 0.0031105 (0.981239s/it) i = 25 t = 0.0598233 dt = 0.00311892 (0.986686s/it) Time spent in iteration: 23.9751 Correctness: sum(rh) difference = 1.45519e-11 sum(vx) = -0.242582 sum(vy) = -0.295116 sum(vz) = -0.335474 sum(rh*T) difference = -297.099 dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-plain Checking vendor_id string... GenuineIntel Disassembling tramp3d-v4-pentiumm-plain, please wait... i486: 0 i586: 0 ppro: 135 mmx: 0 sse: 0 sse2: 4 sse3: 0 tramp3d-v4-pentiumm-plain will run on Pentium IV (pentium4) or higher processor. [-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe] dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp -o tramp3d-v4-pentiumm 97.73user 1.01system 1:38.05elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+197280minor)pagefaults 0swaps dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm -n 25 --cartvis 1.0 0.0 --rhomin 1e-8 Using using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1] solving eeq time increments from [0, 1.79769e+308], cfl 0.5 starting at t = 0, i = 1 cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] face physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] periodic boundaries in X Y Z i = 1 t = 0.00209225 dt = 0.00209225 (0.069342s/it) i = 2 t = 0.00410537 dt = 0.00201312 (0.968165s/it) i = 3 t = 0.00603889 dt = 0.00193352 (0.985737s/it) i = 4 t = 0.00794139 dt = 0.00190251 (0.999364s/it) i = 5 t = 0.00984636 dt = 0.00190497 (1.01105s/it) i = 6 t = 0.0117508 dt = 0.00190449 (1.01161s/it) i = 7 t = 0.013681 dt = 0.00193011 (1.02449s/it) i = 8 t = 0.0156598 dt = 0.0019788 (1.02412s/it) i = 9 t = 0.0176706 dt = 0.00201081 (1.02851s/it) i = 10 t = 0.0197364 dt = 0.0020658 (1.02592s/it) i = 11 t = 0.0218716 dt = 0.0021352 (1.03424s/it) i = 12 t = 0.0240721 dt = 0.00220057 (1.0353s/it) i = 13 t = 0.0263471 dt = 0.002275 (1.03373s/it) i = 14 t = 0.0287159 dt = 0.00236875 (1.03266s/it) i = 15 t = 0.0311533 dt = 0.00243738 (1.03526s/it) i = 16 t = 0.0336768 dt = 0.0025235 (1.02011s/it) i = 17 t = 0.0362863 dt = 0.00260952 (1.0232s/it) i = 18 t = 0.0389715 dt = 0.00268521 (1.02476s/it) i = 19 t = 0.0417381 dt = 0.00276665 (1.0153s/it) i = 20 t = 0.0445873 dt = 0.00284919 (1.00431s/it) i = 21 t = 0.0475216 dt = 0.0029343 (1.00313s/it) i = 22 t = 0.0505258 dt = 0.00300413 (0.989761s/it) i = 23 t = 0.0535938 dt = 0.00306807 (0.99909s/it) i = 24 t = 0.0567043 dt = 0.0031105 (0.989536s/it) i = 25 t = 0.0598233 dt = 0.00311892 (0.996134s/it) Time spent in iteration: 24.3848 Correctness: sum(rh) difference = 1.45519e-11 sum(vx) = -0.242582 sum(vy) = -0.295116 sum(vz) = -0.335474 sum(rh*T) difference = -297.099 dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm Checking vendor_id string... GenuineIntel Disassembling tramp3d-v4-pentiumm, please wait... i486: 0 i586: 0 ppro: 135 mmx: 0 sse: 0 sse2: 0 sse3: 2 tramp3d-v4-pentiumm will run on Pentium IV (pentium4) w/ SSE3 or higher processor. [-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe] dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp -o tramp3d-v4-pentiumm-sse 98.40user 0.94system 1:39.15elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (3major+198438minor)pagefaults 0swaps dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-sse -n 25 --cartvis 1.0 0.0 --rhomin 1e-8 Using using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1] solving eeq time increments from [0, 1.79769e+308], cfl 0.5 starting at t = 0, i = 1 cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] face physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1] periodic boundaries in X Y Z i = 1 t = 0.00209225 dt = 0.00209225 (0.0617449s/it) i = 2 t = 0.00410537 dt = 0.00201312 (0.897831s/it) i = 3 t = 0.00603889 dt = 0.00193352 (0.964484s/it) i = 4 t = 0.00794139 dt = 0.00190251 (0.94189s/it) i = 5 t = 0.00984636 dt = 0.00190497 (0.972172s/it) i = 6 t = 0.0117508 dt = 0.00190449 (0.973818s/it) i = 7 t = 0.013681 dt = 0.00193011 (0.984364s/it) i = 8 t = 0.0156598 dt = 0.0019788 (0.988743s/it) i = 9 t = 0.0176706 dt = 0.00201081 (0.996885s/it) i = 10 t = 0.0197364 dt = 0.0020658 (0.997118s/it) i = 11 t = 0.0218716 dt = 0.0021352 (1.00016s/it) i = 12 t = 0.0240721 dt = 0.00220057 (0.99685s/it) i = 13 t = 0.0263471 dt = 0.002275 (0.998231s/it) i = 14 t = 0.0287159 dt = 0.00236875 (1.00025s/it) i = 15 t = 0.0311533 dt = 0.00243738 (0.987068s/it) i = 16 t = 0.0336768 dt = 0.0025235 (0.981898s/it) i = 17 t = 0.0362863 dt = 0.00260952 (0.990963s/it) i = 18 t = 0.0389715 dt = 0.00268521 (0.986071s/it) i = 19 t = 0.0417381 dt = 0.00276665 (0.980461s/it) i = 20 t = 0.0445873 dt = 0.00284919 (0.982345s/it) i = 21 t = 0.0475216 dt = 0.0029343 (1.00055s/it) i = 22 t = 0.0505258 dt = 0.00300413 (0.995297s/it) i = 23 t = 0.0535938 dt = 0.00306807 (1.00189s/it) i = 24 t = 0.0567043 dt = 0.0031105 (1.00527s/it) i = 25 t = 0.0598233 dt = 0.00311892 (1.01299s/it) Time spent in iteration: 23.6994 Correctness: sum(rh) difference = 1.28966e-08 sum(vx) = -0.242582 sum(vy) = -0.295116 sum(vz) = -0.335474 sum(rh*T) difference = -297.099 dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-sse Checking vendor_id string... GenuineIntel Disassembling tramp3d-v4-pentiumm-sse, please wait... i486: 0 i586: 0 ppro: 84 mmx: 44 sse: 0 sse2: 3089 sse3: 0 tramp3d-v4-pentiumm-sse will run on Pentium IV (pentium4) or higher processor. Keep in mind that anything that does strip-flags (ie. GCC, glibc, kernel, etc.) will remove both -msse3 and -mfpmath from your C[XX]FLAGS Very little difference in runtimes, maybe half a second, and next to no difference in compile time. Surprisingly, -O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe was the slowest. I reran the test to be sure and it was slightly worse (24.5397s) than the original run. It also appears -mfpmath=sse does not generate sse3 instructions. ---------- I also tested "-O2 -march=prescott -mfpmath=sse -fomit-frame-pointer -pipe". I forgot to record the results, but the times were a small improvement over "-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe", somewhere btwn 22.8 and 23.2s if I remember correctly. --de.
Attachment:
signature.asc
Description: OpenPGP digital signature