Fwd: Function Parameters vs. Prologue (fastcall + O1 + -fschedule-insns2)

Geyslan Gregório Bem <geyslan@xxxxxxxxx> · Wed, 19 Jun 2013 11:41:09 -0300

Ian,

I would like to know what did the reordering only when I use the
fastcall attributes (registers).

The benchmarkings:

To reproduce the block assembly below use "gcc-o structs.c
structs-m32-O1-fschedule-insns2" (there are other options that are
causing this instructions merge, so I used O1 option).

(gdb) disas main
Dump of assembler code for function main:
   0x0804868c <+0>: push   %ebp
   0x0804868d <+1>: mov    $0x3,%edx
   0x08048692 <+6>: mov    %esp,%ebp
   0x08048694 <+8>: mov    $0x2,%ecx
   0x08048699 <+13>: and    $0xfffffff0,%esp
   0x0804869c <+16>: call   0x804845c <funcao>

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1
uzumaki@hb:~$ gdb -q structs
Lendo símbolos de /home/uzumaki/structs...(no debugging symbols
found)...concluído.
(gdb) disas main
Dump of assembler code for function main:
   0x0804868c <+0>: push   %ebp
   0x0804868d <+1>: mov    %esp,%ebp
   0x0804868f <+3>: and    $0xfffffff0,%esp
   0x08048692 <+6>: mov    $0x3,%edx
   0x08048697 <+11>: mov    $0x2,%ecx
   0x0804869c <+16>: call   0x804845c <funcao>

The above block is with fastcall attribute for the function, but
without the schedule additional pass.

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O1 -fschedule-insns2
uzumaki@hb:~$ gdb -q structs
Lendo símbolos de /home/uzumaki/structs...(no debugging symbols
found)...concluído.
(gdb) disas main
Dump of assembler code for function main:
   0x08048694 <+0>: push   %ebp
   0x08048695 <+1>: mov    %esp,%ebp
   0x08048697 <+3>: and    $0xfffffff0,%esp
   0x0804869a <+6>: sub    $0x10,%esp
   0x0804869d <+9>: movl   $0x3,0x4(%esp)
   0x080486a5 <+17>: movl   $0x2,(%esp)
   0x080486ac <+24>: call   0x804845c <funcao>

And this last uses the additional schedule pass but without the
fastcall attribute (parameters passed on stack).

The curious thing is that this intercalation only happens when using
scheduling attribute and fastcall function together.

So let's see if this intercalation actually optimizes the pipeline
(forgive me for the clumsy benchmark; I appreciate any suggestions for
improvement).

Benchmark 1 - Assembly in intercalated order

#include <stdio.h>
#include <time.h>

double get_time() {

        return (double) clock() / CLOCKS_PER_SEC;
}

int main() {

        double start, end, average;
        int loop1, loop2;

        for (loop1 = 0; loop1 < 30; loop1++) {

                start = get_time();

                for (loop2 = 0; loop2 < 1000000000; loop2++) {
                        __asm__ ("push %ebp\n\t"
                                 "mov $0x3, %edx\n\t"
                                 "mov %esp, %ebp\n\t"
                                 "mov $0x2, %ecx\n\t"
                                 // replacement esp to ebp, so the
code continues without error (AND and MOV latencies are equivalent)
                                 //"and $0xfffffff0, %esp\n\t"
                                 "mov $0xfffffff0, %ebp\n\t"

                                 "pop %ebp");
                }

                end = get_time();
                average += (end - start);

        }

        printf("%.52f\n", average / 30);

        return 0;

}

Benchmark 2 - Assembly in logic order

...
                for (loop2 = 0; loop2 < 1000000000; loop2++) {
                        __asm__ ("push %ebp\n\t"
                                 "mov %esp, %ebp\n\t"
                                 // replacement esp to ebp, so the
code continues without error (AND and MOV latencies are equivalent)
                                 //"and $0xfffffff0, %esp\n\t"
                                 "mov $0xfffffff0, %ebp\n\t"

                                 "mov $0x3, %edx\n\t"
                                 "mov $0x2, %ecx\n\t"

                                 "pop %ebp");
                }
...

Results obtained in equivalent circumstances:

Benchmark 1 (intercalated order)
1.7033333333333333658998753890045918524265289306640625

Benchmark 2 (logical order)
1.6986666666666667691032444054144434630870819091796875

# uzumaki@hb:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz
stepping : 7
microcode : 0x23
cpu MHz : 2201.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts
dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 4390.23
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

The snippets equivalents were repeated 30 billion times in both
benchmarks, extracting the lapse of seconds per billion and totaling
the final average. In all tests performed, the code in logical
sequence was very slightly faster, but it was.

It seems to me that the schedule-insns2 option is problematic, indeed, see:
https://groups.google.com/forum/?fromgroups#!topic/gnu.gcc.help/hZ5hArJ3VSU

Thus, disabling that option the code is gerenated in logical order and
optimized (with -O3, fastcall attribute, and disabling the pass
additional schedule).

uzumaki@hb:~$ gcc structs.c -o structs -m32 -O3 -fno-schedule-insns2
uzumaki@hb:~$ gdb -q structs
Lendo símbolos de /home/uzumaki/structs...(no debugging symbols
found)...concluído.
(gdb) disas main
Dump of assembler code for function main:
   0x08048370 <+0>: push   %ebp
   0x08048371 <+1>: mov    %esp,%ebp
   0x08048373 <+3>: and    $0xfffffff0,%esp
   0x08048376 <+6>: mov    $0x3,%edx
   0x0804837b <+11>: mov    $0x2,%ecx
   0x08048380 <+16>: call   0x8048470 <funcao>

Draw your conclusions! Awaiting feedback.

Geyslan Gregório Bem
hackingbits.com
@geyslangb
br.linkedin.com/in/geyslan

2013/6/19 Ian Lance Taylor <iant@xxxxxxxxxx>:
> On Wed, Jun 19, 2013 at 6:23 AM, Geyslan Gregório Bem <geyslan@xxxxxxxxx> wrote:
>>
>> I know that -fschedule-insns2 reorder the instructions to avoid
>> bubble, but I did a benchmark that resulted in a more fast code
>> without this reordering.
>>
>> What do you have to tell me?
>
> File a missed-optimization bug report?
>
> I'm not really sure what you are asking.
>
> Ian