Ian, Thank you for the answer. I found a discrepance in my benchmarks, even with identical latency the MOV (in accordance with the intel manual), that I used instead AND, was delaying the process. I replaced it to AND as the original code. Unfortunately the Intel Optimization Manual does not include all instructions for the Sandy Bridge. I could only compare them based on the processors and 0F_3H and 0F_2H (that I don't know what processors are). >Since the differences are subtle you should make sure you are using >the correct -mtune option for the machine on which you are running >your benchmarks. I wasn't using mtune, so I did this time. ** Remembering that these intercalations occur when using the attribute fastcall + fschedule-insns2 as well as other options for instructions reordering. Well, the first benchmark with the intercalation (prologue + parameter + prologue + parameter + prologue) was erratic (see times). Probably because the bubble. uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2 (without mtune) (gdb) disas main Dump of assembler code for function main: 0x0804868c <+0>: push %ebp 0x0804868d <+1>: mov $0x3,%edx 0x08048692 <+6>: mov %esp,%ebp 0x08048694 <+8>: mov $0x2,%ecx 0x08048699 <+13>: and $0xfffffff0,%esp 0x0804869c <+16>: call 0x804845c <funcao> uzumaki@hb:~$ gcc benchmark0.c -o benchmark0 -m32 -O3 -mtune=native uzumaki@hb:~$ for i in {0..9}; do ./benchmark; done Total elapsed: 48.9600000000000008526512829121202230453491210937500000 Average: 1.6320000000000001172395514004165306687355041503906250 Total elapsed: 49.0200000000000031263880373444408178329467773437500000 Average: 1.6340000000000001190159082398167811334133148193359375 Total elapsed: 48.9600000000000008526512829121202230453491210937500000 Average: 1.6320000000000001172395514004165306687355041503906250 Total elapsed: 50.5200000000000031263880373444408178329467773437500000 Average: 1.6840000000000001634248292248230427503585815429687500 Total elapsed: 50.5000000000000000000000000000000000000000000000000000 Average: 1.6833333333333333481363069950020872056484222412109375 Total elapsed: 49.0000000000000000000000000000000000000000000000000000 Average: 1.6333333333333333037273860099958255887031555175781250 Total elapsed: 48.9600000000000008526512829121202230453491210937500000 Average: 1.6320000000000001172395514004165306687355041503906250 Total elapsed: 49.0200000000000031263880373444408178329467773437500000 Average: 1.6340000000000001190159082398167811334133148193359375 Total elapsed: 50.5799999999999982946974341757595539093017578125000000 Average: 1.6859999999999999431565811391919851303100585937500000 Total elapsed: 50.5600000000000022737367544323205947875976562500000000 Average: 1.6853333333333333499126638344023376703262329101562500 The second was made using the logic sequence; prologue first, parameters later. That gererated a more stable optimization. uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 (or: -O3 -fno-schedule-insns2 -mtune=native) (gdb) disas main Dump of assembler code for function main: 0x0804868c <+0>: push %ebp 0x0804868d <+1>: mov %esp,%ebp 0x0804868f <+3>: and $0xfffffff0,%esp 0x08048692 <+6>: mov $0x3,%edx 0x08048697 <+11>: mov $0x2,%ecx 0x0804869c <+16>: call 0x804845c <funcao> uzumaki@hb:~$ gcc benchmark1.c -o benchmark1 -m32 -O3 -mtune=native uzumaki@hb:~$ for i in {0..9}; do ./benchmark1; done Total elapsed: 49.0000000000000000000000000000000000000000000000000000 Average: 1.6333333333333333037273860099958255887031555175781250 Total elapsed: 49.2999999999999971578290569595992565155029296875000000 Average: 1.6433333333333333126091702069970779120922088623046875 Total elapsed: 49.0000000000000000000000000000000000000000000000000000 Average: 1.6333333333333333037273860099958255887031555175781250 Total elapsed: 49.0200000000000031263880373444408178329467773437500000 Average: 1.6340000000000001190159082398167811334133148193359375 Total elapsed: 49.0799999999999982946974341757595539093017578125000000 Average: 1.6359999999999998987476601541857235133647918701171875 Total elapsed: 49.2000000000000028421709430404007434844970703125000000 Average: 1.6400000000000001243449787580175325274467468261718750 Total elapsed: 49.2199999999999988631316227838397026062011718750000000 Average: 1.6406666666666667175888960628071799874305725097656250 Total elapsed: 49.9600000000000008526512829121202230453491210937500000 Average: 1.6653333333333333321490954403998330235481262207031250 Total elapsed: 49.8599999999999994315658113919198513031005859375000000 Average: 1.6619999999999999218402990663889795541763305664062500 Total elapsed: 49.9200000000000017053025658242404460906982421875000000 Average: 1.6640000000000001456612608308205381035804748535156250 And as third, I use the mtune that generated another intercalation (prologue + all parameters + prologue) This was the more optmized as you can see in the benchmark. uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2 -mtune=native (or: -O3 -mtune=native) (gdb) disas main Dump of assembler code for function main: 0x0804868c <+0>: push %ebp 0x0804868d <+1>: mov $0x3,%edx 0x08048692 <+6>: mov $0x2,%ecx 0x08048697 <+11>: mov %esp,%ebp 0x08048699 <+13>: and $0xfffffff0,%esp 0x0804869c <+16>: call 0x804845c <funcao> uzumaki@hb:~$ gcc benchmark2.c -o benchmark2 -m32 -O3 -mtune=native uzumaki@hb:~$ for i in {0..9}; do ./benchmark2; done Total elapsed: 49.0000000000000000000000000000000000000000000000000000 Average: 1.6333333333333333037273860099958255887031555175781250 Total elapsed: 48.9200000000000017053025658242404460906982421875000000 Average: 1.6306666666666667087071118658059276640415191650390625 Total elapsed: 48.8800000000000025579538487363606691360473632812500000 Average: 1.6293333333333335222192772562266327440738677978515625 Total elapsed: 48.8999999999999985789145284797996282577514648437500000 Average: 1.6299999999999998934185896359849721193313598632812500 Total elapsed: 48.8599999999999994315658113919198513031005859375000000 Average: 1.6286666666666667069307550264056771993637084960937500 Total elapsed: 48.8800000000000025579538487363606691360473632812500000 Average: 1.6293333333333335222192772562266327440738677978515625 Total elapsed: 48.9399999999999977262632455676794052124023437500000000 Average: 1.6313333333333333019510291705955751240253448486328125 Total elapsed: 48.8400000000000034106051316484808921813964843750000000 Average: 1.6280000000000001136868377216160297393798828125000000 Total elapsed: 48.8999999999999985789145284797996282577514648437500000 Average: 1.6299999999999998934185896359849721193313598632812500 Total elapsed: 48.7400000000000019895196601282805204391479492187500000 Average: 1.6246666666666667033780413476051762700080871582031250 As you said, by using heuristics, the best result is not always achieved. However, I believe to have identified a conflict of priorities when ordering instructions, due to the use of fastcall attribute with the optimization option and with or without the use of mtune. It may just be me, after all the tests were few. Especially because with a little tune, I could generate a more optimized code (last benchmark). If you still see it as a missed-optimization bug, tell me that I'll report it. See you. Geyslan Gregório Bem hackingbits.com @geyslangb br.linkedin.com/in/geyslan 2013/6/19 Ian Lance Taylor <iant@xxxxxxxxxx>: > On Wed, Jun 19, 2013 at 8:21 AM, Geyslan Gregório Bem <geyslan@xxxxxxxxx> wrote: >> >> About technicalities of the option and whether reordering is not really >> optimizing as demonstrated in the benchmark. > > Every complex optimization has cases where it will actually make code > worse. All optimizations rely on heuristics, and those heuristics > sometimes fail. You have very likely found such a case, which is why > I suggested filing a missed-optimization bug report. It may be > possible to fix it; I don't know. > > Since the differences are subtle you should make sure you are using > the correct -mtune option for the machine on which you are running > your benchmarks. > > Ian