Re: Non-optimal code generated for H8

David Brown <david@xxxxxxxxxxxxxxx> · Wed, 30 Oct 2019 10:06:57 +0100

On 30/10/2019 01:34, Segher Boessenkool wrote:
> On Tue, Oct 29, 2019 at 02:19:25PM -0600, Jeff Law wrote:
>> On 10/29/19 2:03 PM, Mikael Tillenius wrote:
>>> I am using a cross compiler for Renesas H8S. In a few places it
>>> generates really bad code. Given the following program:
>>>
>>> struct s {
>>>     char a, b;
>>>     char c[11];
>>> } x[2];
>>>
>>> void test(int n)
>>> {
>>>     struct s *sp = &x[n];
>>>
>>>     sp->a = 1;
>>>     sp->b = 1;
>>> }
> 
>> As we leave gimple the code looks like:
>>
>>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].a = 1;
>>   MEM <struct s[2]> [(struct s *)&x][n_1(D)].b = 1;
>>
>> One might argue that DOM or FRE should have created a common
>> subexpression for the address arithmetic here.  Even so it's not bad.
>>
>> CSE doesn't do its job though.  THere's clearly a REG_EQUAL note which
>> should have allowed it to at least cleanup the redundant multiplication
>> for the address calculation.
> 
> And on other targets it does do its job fine, say riscv32, or m68k -O1
> (the -O1 to prevent the two stores from being optimised into one).
> 
> I haven't managed to find another target where multiplication by 13 is
> done with a libcall though.  Maybe I should look harder.
> 

I checked on the 8-bit AVR, which is (I think) the smallest and simplest
device targeted by gcc, and where only some devices have multiplication
instructions.  If you optimise for size (-Os), it uses a library call
__mulhi3 for the multiplication:

test:
        ldi r22,lo8(13)
        ldi r23,0
        rcall __mulhi3
        subi r24,lo8(-(x))
        sbci r25,hi8(-(x))
        ldi r18,lo8(1)
        mov r26,r24
        mov r27,r25
        st X,r18
        mov r30,r26
        mov r31,r27
        std Z+1,r18
        ret

With -O1 or -O2, it uses shifts and adds for the multiply:

test:
        mov r30,r24
        mov r31,r25
        lsl r30
        rol r31
        add r30,r24
        adc r31,r25
        lsl r30
        rol r31
        lsl r30
        rol r31
        add r24,r30
        adc r25,r31
        mov r30,r24
        mov r31,r25
        subi r30,lo8(-(x))
        sbci r31,hi8(-(x))
        ldi r24,lo8(1)
        st Z,r24
        std Z+1,r24
        ret

(I used an old version of avr-gcc here, version 5.4.0, just because it
is on the extremely useful <https://godbolt.org> online compiler site.
Things may have changed for later versions, but usually the AVR port is
fairly stable.)

One thing to note here is that with -O1, the compiler calculates "sp" in
the "Z" register, then stores 1 into [sp] and [sp+1].  With the
multiplication, sp is calculated in a non-pointer register pair and the
compiler generates sub-optimal code for storing in [sp] and [sp+1].  If
that is still the case for modern gcc, it could be filed as a missed
optimisation bug for the AVR backend.

But this also brings up another idea.  Is the OP using "-Os"
optimisation?  My experience (especially with AVR, msp430, and ARM
Cortex-M targets) is that "-Os" optimisation is often quite poor
compared to "-O2".  It can result in very significantly slower code for
a saving of a couple of bytes, and in some cases the code can be
significantly /bigger/ than with -O2.  I don't know whether this is a
backend issue, or a general problem with "-Os", but I no longer use or
recommend "-Os" even for tiny embedded systems.

So perhaps simply changing from "-Os" to "-O2" will fix the OP's problems.