Re: Function Parameters vs. Prologue (fastcall + O1 + -fschedule-insns2)

Geyslan Gregório Bem <geyslan@xxxxxxxxx> · Wed, 19 Jun 2013 16:36:22 -0300

Ian,

Thank you for the answer.

I found a discrepance in my benchmarks, even with identical latency
the MOV (in accordance with the intel manual), that I used instead
AND, was delaying the process. I replaced it to AND as the original
code. Unfortunately the Intel Optimization Manual does not include all
instructions for the Sandy Bridge. I could only compare them based on
the processors and 0F_3H and 0F_2H (that I don't know what processors
are).

>Since the differences are subtle you should make sure you are using
>the correct -mtune option for the machine on which you are running
>your benchmarks.

I wasn't using mtune, so I did this time.

** Remembering that these intercalations occur when using the
attribute fastcall + fschedule-insns2 as well as other options for
instructions reordering.

Well, the first benchmark with the intercalation (prologue + parameter
+ prologue + parameter + prologue) was erratic (see times). Probably
because the bubble.

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2
(without mtune)

(gdb) disas main
Dump of assembler code for function main:
   0x0804868c <+0>: push   %ebp
   0x0804868d <+1>: mov    $0x3,%edx
   0x08048692 <+6>: mov    %esp,%ebp
   0x08048694 <+8>: mov    $0x2,%ecx
   0x08048699 <+13>: and    $0xfffffff0,%esp
   0x0804869c <+16>: call   0x804845c <funcao>

uzumaki@hb:~$ gcc benchmark0.c -o benchmark0 -m32 -O3 -mtune=native

uzumaki@hb:~$ for i in {0..9}; do ./benchmark; done
Total elapsed: 48.9600000000000008526512829121202230453491210937500000
Average:       1.6320000000000001172395514004165306687355041503906250
Total elapsed: 49.0200000000000031263880373444408178329467773437500000
Average:       1.6340000000000001190159082398167811334133148193359375
Total elapsed: 48.9600000000000008526512829121202230453491210937500000
Average:       1.6320000000000001172395514004165306687355041503906250
Total elapsed: 50.5200000000000031263880373444408178329467773437500000
Average:       1.6840000000000001634248292248230427503585815429687500
Total elapsed: 50.5000000000000000000000000000000000000000000000000000
Average:       1.6833333333333333481363069950020872056484222412109375
Total elapsed: 49.0000000000000000000000000000000000000000000000000000
Average:       1.6333333333333333037273860099958255887031555175781250
Total elapsed: 48.9600000000000008526512829121202230453491210937500000
Average:       1.6320000000000001172395514004165306687355041503906250
Total elapsed: 49.0200000000000031263880373444408178329467773437500000
Average:       1.6340000000000001190159082398167811334133148193359375
Total elapsed: 50.5799999999999982946974341757595539093017578125000000
Average:       1.6859999999999999431565811391919851303100585937500000
Total elapsed: 50.5600000000000022737367544323205947875976562500000000
Average:       1.6853333333333333499126638344023376703262329101562500

The second was made using the logic sequence; prologue first,
parameters later. That gererated a more stable optimization.

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 (or: -O3
-fno-schedule-insns2 -mtune=native)

(gdb) disas main
Dump of assembler code for function main:
   0x0804868c <+0>: push   %ebp
   0x0804868d <+1>: mov    %esp,%ebp
   0x0804868f <+3>: and    $0xfffffff0,%esp
   0x08048692 <+6>: mov    $0x3,%edx
   0x08048697 <+11>: mov    $0x2,%ecx
   0x0804869c <+16>: call   0x804845c <funcao>

uzumaki@hb:~$ gcc benchmark1.c -o benchmark1 -m32 -O3 -mtune=native

uzumaki@hb:~$ for i in {0..9}; do ./benchmark1; done
Total elapsed: 49.0000000000000000000000000000000000000000000000000000
Average:       1.6333333333333333037273860099958255887031555175781250
Total elapsed: 49.2999999999999971578290569595992565155029296875000000
Average:       1.6433333333333333126091702069970779120922088623046875
Total elapsed: 49.0000000000000000000000000000000000000000000000000000
Average:       1.6333333333333333037273860099958255887031555175781250
Total elapsed: 49.0200000000000031263880373444408178329467773437500000
Average:       1.6340000000000001190159082398167811334133148193359375
Total elapsed: 49.0799999999999982946974341757595539093017578125000000
Average:       1.6359999999999998987476601541857235133647918701171875
Total elapsed: 49.2000000000000028421709430404007434844970703125000000
Average:       1.6400000000000001243449787580175325274467468261718750
Total elapsed: 49.2199999999999988631316227838397026062011718750000000
Average:       1.6406666666666667175888960628071799874305725097656250
Total elapsed: 49.9600000000000008526512829121202230453491210937500000
Average:       1.6653333333333333321490954403998330235481262207031250
Total elapsed: 49.8599999999999994315658113919198513031005859375000000
Average:       1.6619999999999999218402990663889795541763305664062500
Total elapsed: 49.9200000000000017053025658242404460906982421875000000
Average:       1.6640000000000001456612608308205381035804748535156250

And as third, I use the mtune that generated another intercalation
(prologue + all parameters + prologue)
This was the more optmized as you can see in the benchmark.

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2
-mtune=native (or: -O3 -mtune=native)

(gdb) disas main
Dump of assembler code for function main:
   0x0804868c <+0>: push   %ebp
   0x0804868d <+1>: mov    $0x3,%edx
   0x08048692 <+6>: mov    $0x2,%ecx
   0x08048697 <+11>: mov    %esp,%ebp
   0x08048699 <+13>: and    $0xfffffff0,%esp
   0x0804869c <+16>: call   0x804845c <funcao>

uzumaki@hb:~$ gcc benchmark2.c -o benchmark2 -m32 -O3 -mtune=native

uzumaki@hb:~$ for i in {0..9}; do ./benchmark2; done
Total elapsed: 49.0000000000000000000000000000000000000000000000000000
Average:       1.6333333333333333037273860099958255887031555175781250
Total elapsed: 48.9200000000000017053025658242404460906982421875000000
Average:       1.6306666666666667087071118658059276640415191650390625
Total elapsed: 48.8800000000000025579538487363606691360473632812500000
Average:       1.6293333333333335222192772562266327440738677978515625
Total elapsed: 48.8999999999999985789145284797996282577514648437500000
Average:       1.6299999999999998934185896359849721193313598632812500
Total elapsed: 48.8599999999999994315658113919198513031005859375000000
Average:       1.6286666666666667069307550264056771993637084960937500
Total elapsed: 48.8800000000000025579538487363606691360473632812500000
Average:       1.6293333333333335222192772562266327440738677978515625
Total elapsed: 48.9399999999999977262632455676794052124023437500000000
Average:       1.6313333333333333019510291705955751240253448486328125
Total elapsed: 48.8400000000000034106051316484808921813964843750000000
Average:       1.6280000000000001136868377216160297393798828125000000
Total elapsed: 48.8999999999999985789145284797996282577514648437500000
Average:       1.6299999999999998934185896359849721193313598632812500
Total elapsed: 48.7400000000000019895196601282805204391479492187500000
Average:       1.6246666666666667033780413476051762700080871582031250

As you said, by using heuristics, the best result is not always
achieved. However, I believe to have identified a conflict of
priorities when ordering instructions, due to the use of fastcall
attribute with the optimization option and with or without the use of
mtune. It may just be me, after all the tests were few. Especially
because with a little tune, I could generate a more optimized code
(last benchmark).

If you still see it as a missed-optimization bug, tell me that I'll report it.

See you.

Geyslan Gregório Bem
hackingbits.com
@geyslangb
br.linkedin.com/in/geyslan

2013/6/19 Ian Lance Taylor <iant@xxxxxxxxxx>:
> On Wed, Jun 19, 2013 at 8:21 AM, Geyslan Gregório Bem <geyslan@xxxxxxxxx> wrote:
>>
>> About technicalities of the option and whether reordering is not really
>> optimizing as demonstrated in the benchmark.
>
> Every complex optimization has cases where it will actually make code
> worse.  All optimizations rely on heuristics, and those heuristics
> sometimes fail.  You have very likely found such a case, which is why
> I suggested filing a missed-optimization bug report.  It may be
> possible to fix it; I don't know.
>
> Since the differences are subtle you should make sure you are using
> the correct -mtune option for the machine on which you are running
> your benchmarks.
>
> Ian