Re: Optimisation puzzle

"Lawrence Crowl" <crowl@xxxxxxxxxx> · Mon, 19 Mar 2007 17:23:44 -0700

On 3/17/07, Erik <sigra@xxxxxxx> wrote:
Lawrence Crowl skrev:
> On 3/16/07, Erik <sigra@xxxxxxx> wrote:
>> Ian Lance Taylor skrev:
>> > Erik <sigra@xxxxxxx> writes:
>> >
>> >
>> >>  From man:puts I see that it is declared "int puts(const char
>> >> *)". This means that puts does not promise to leave its argument
>> >> unchanged. Therefore the caller must push the argument anew before
>> >> each call. If it had been declared "int puts(const char * const)"
>> >> instead, the push should be moved outside the loop. Unfortunately
>> this
>> >> does not seem to work. I tried with the following program:
>> >> void q(const unsigned int);
>> >> void f() {for (unsigned int x = 0; x != 10; x++) q(77);}
>> >>
>> >> and built it with "gcc -std=c99 -Os -Wall -Wextra -Werror -S":
>> >> .L2:
>> >>         subl    $12, %esp
>> >>         incl    %ebx
>> >>         pushl   $77
>> >>         call    q
>> >>         addl    $16, %esp
>> >>         cmpl    $10, %ebx
>> >>         jne     .L2
>> >>
>> >> As you can see, "pushl   $77" is still inside the loop even though q
>> >> promises to not change its argument. This must be a bug.
>> >
>> > This is not a bug.  const on an automatic variable in C is more
>> > advisory than anything else.  You are not permitted to change a const
>> > object, but you can cast its address to a non-const pointer.
>
> The "lost optimization" in this case has nothing to do with const
> versus non-const.  The issue is that the call deallocates the
> parameters. The pushl is allocating the argument.
OK, as I understand it, the pushl $77 does 2 things:
1. Copy the value 77.
2. Allocate space for it (by changing the stack pointer).

And in C, allocating the space must be done in each iteration because
the called function deallocates it. But copying the value 77 in each
iteration could be optimized away.

Well actually the caller deallocates it, but on the other side of the
basic-block boundary.  Look above for the adjustments to the stack
pointer.

But what about Ada? Does it have the same calling convention?
See the Ada loop:
.L5:
        pushl   $77
.LCFI3:
        call    _ada_q
        popl    %eax
        decl    %ebx
        jns     .L5

I don't know the Ada ABI, but this code indicates that it does follow
the same convention.  It would be a bit surprising if it did not because
that would make leveraging the system more difficult.

It looks like in Ada, the caller deallocates the parameter (popl). If
this is so, it would mean that both the push and the pop should be moved
out of the loop, something like:
        pushl   $77
.L5:
.LCFI3:
        call    _ada_q
        decl    %ebx
        jns     .L5
        popl    %eax

This would cut the loop down to only 3 instructions. Not bad compared to
the 7 instructions in the original C loop. But the C version should
probably be counted as having 8 instructions to make the comparison
accurate, since it has a hidden popl inside q, which is not in Q of the
Ada version. Is this correct?

Well, close.  Theoretically, the optimization can be done, at least in
C++.  (I'm not entirely familiar with the C rules here.)  The problem
is that changing the details around calls takes extra care if you want
the tool (like a debugger) to be able to follow along.  In this case, I
think you would need to show that the savings is a significant fraction
of the total cost.  You might be able to do that for static code size,
but you'll have a harder time for the dynamic instruction count,
which is what most compilers optimize.

--
Lawrence Crowl