Ian, I would like to know what did the reordering only when I use the fastcall attributes (registers). The benchmarkings: To reproduce the block assembly below use "gcc-o structs.c structs-m32-O1-fschedule-insns2" (there are other options that are causing this instructions merge, so I used O1 option). (gdb) disas main Dump of assembler code for function main: 0x0804868c <+0>: push %ebp 0x0804868d <+1>: mov $0x3,%edx 0x08048692 <+6>: mov %esp,%ebp 0x08048694 <+8>: mov $0x2,%ecx 0x08048699 <+13>: and $0xfffffff0,%esp 0x0804869c <+16>: call 0x804845c <funcao> uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 uzumaki@hb:~$ gdb -q structs Lendo símbolos de /home/uzumaki/structs...(no debugging symbols found)...concluído. (gdb) disas main Dump of assembler code for function main: 0x0804868c <+0>: push %ebp 0x0804868d <+1>: mov %esp,%ebp 0x0804868f <+3>: and $0xfffffff0,%esp 0x08048692 <+6>: mov $0x3,%edx 0x08048697 <+11>: mov $0x2,%ecx 0x0804869c <+16>: call 0x804845c <funcao> The above block is with fastcall attribute for the function, but without the schedule additional pass. uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2 uzumaki@hb:~$ gdb -q structs Lendo símbolos de /home/uzumaki/structs...(no debugging symbols found)...concluído. (gdb) disas main Dump of assembler code for function main: 0x08048694 <+0>: push %ebp 0x08048695 <+1>: mov %esp,%ebp 0x08048697 <+3>: and $0xfffffff0,%esp 0x0804869a <+6>: sub $0x10,%esp 0x0804869d <+9>: movl $0x3,0x4(%esp) 0x080486a5 <+17>: movl $0x2,(%esp) 0x080486ac <+24>: call 0x804845c <funcao> And this last uses the additional schedule pass but without the fastcall attribute (parameters passed on stack). The curious thing is that this intercalation only happens when using scheduling attribute and fastcall function together. So let's see if this intercalation actually optimizes the pipeline (forgive me for the clumsy benchmark; I appreciate any suggestions for improvement). Benchmark 1 - Assembly in intercalated order #include <stdio.h> #include <time.h> double get_time() { return (double) clock() / CLOCKS_PER_SEC; } int main() { double start, end, average; int loop1, loop2; for (loop1 = 0; loop1 < 30; loop1++) { start = get_time(); for (loop2 = 0; loop2 < 1000000000; loop2++) { __asm__ ("push %ebp\n\t" "mov $0x3, %edx\n\t" "mov %esp, %ebp\n\t" "mov $0x2, %ecx\n\t" // replacement esp to ebp, so the code continues without error (AND and MOV latencies are equivalent) //"and $0xfffffff0, %esp\n\t" "mov $0xfffffff0, %ebp\n\t" "pop %ebp"); } end = get_time(); average += (end - start); } printf("%.52f\n", average / 30); return 0; } Benchmark 2 - Assembly in logic order ... for (loop2 = 0; loop2 < 1000000000; loop2++) { __asm__ ("push %ebp\n\t" "mov %esp, %ebp\n\t" // replacement esp to ebp, so the code continues without error (AND and MOV latencies are equivalent) //"and $0xfffffff0, %esp\n\t" "mov $0xfffffff0, %ebp\n\t" "mov $0x3, %edx\n\t" "mov $0x2, %ecx\n\t" "pop %ebp"); } ... Results obtained in equivalent circumstances: Benchmark 1 (intercalated order) 1.7033333333333333658998753890045918524265289306640625 Benchmark 2 (logical order) 1.6986666666666667691032444054144434630870819091796875 # uzumaki@hb:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz stepping : 7 microcode : 0x23 cpu MHz : 2201.000 cache size : 6144 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4390.23 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: The snippets equivalents were repeated 30 billion times in both benchmarks, extracting the lapse of seconds per billion and totaling the final average. In all tests performed, the code in logical sequence was very slightly faster, but it was. It seems to me that the schedule-insns2 option is problematic, indeed, see: https://groups.google.com/forum/?fromgroups#!topic/gnu.gcc.help/hZ5hArJ3VSU Thus, disabling that option the code is gerenated in logical order and optimized (with -O3, fastcall attribute, and disabling the pass additional schedule). uzumaki@hb:~$ gcc structs.c -o structs -m32 -O3 -fno-schedule-insns2 uzumaki@hb:~$ gdb -q structs Lendo símbolos de /home/uzumaki/structs...(no debugging symbols found)...concluído. (gdb) disas main Dump of assembler code for function main: 0x08048370 <+0>: push %ebp 0x08048371 <+1>: mov %esp,%ebp 0x08048373 <+3>: and $0xfffffff0,%esp 0x08048376 <+6>: mov $0x3,%edx 0x0804837b <+11>: mov $0x2,%ecx 0x08048380 <+16>: call 0x8048470 <funcao> Draw your conclusions! Awaiting feedback. Geyslan Gregório Bem hackingbits.com @geyslangb br.linkedin.com/in/geyslan 2013/6/19 Ian Lance Taylor <iant@xxxxxxxxxx>: > On Wed, Jun 19, 2013 at 6:23 AM, Geyslan Gregório Bem <geyslan@xxxxxxxxx> wrote: >> >> I know that -fschedule-insns2 reorder the instructions to avoid >> bubble, but I did a benchmark that resulted in a more fast code >> without this reordering. >> >> What do you have to tell me? > > File a missed-optimization bug report? > > I'm not really sure what you are asking. > > Ian