Re: gcc 4.1.1 poor optimization

Greg Smith <gsmith@xxxxxxxxx> · Wed, 17 Jan 2007 23:01:06 -0500

On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote:

> > Lines 41..45 all deal with trying to figure out if the low-order 2 bits
> > of effective_addr2 are zero.  All I can say is, WTF?  I can get around
> > this one by casting effective_addr2 to U32 and then testl/jne is
> > emitted, but I shouldn't have to do this??
> 
> This is most likely fixed by the lower-subreg patch I have been
> working on.  Latest version here:
> 
> http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html

I pulled the source from subversion today.  There is no improvement in
the emitted assembler or my benchmarks.

However, with your subreg patch above against today's source, my
benchmarks show a 10% improvement!!  The emitted code looks (mostly)
sane, no reloading registers with a value already in a register.

It seems that once a certain level of complexity has been reached that
the quality of the emitted assembler goes down, and your subreg patch is
enough to get that complexity below the threshold.

The particular routine looks like, processed:

void (__attribute__ ( regparm(2) ) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;
    U32 temp = fetch_fw(inst); 
    r1 = (temp >> 20) & 0xf;
    b2 = (temp >> 16) & 0xf;
    effective_addr2 = temp & 0xfff;
    if(b2) effective_addr2 += (regs)->gr[((b2))].D; // U64
    b2 = (temp >> 12) & 0xf;
    if(b2) effective_addr2 += regs->gr[b2].D; // U64
    effective_addr2 &= regs->psw.amask.D; // U64
    regs->ip += 4;
    regs->psw.ilc = 4;
    regs->gr[r1].F.L.F = z900_vfetch4(effective_addr2, b2, regs); // U32
}

fetch_fw is inlined and looks like:

static __inline__ U32 fetch_fw(volatile void *ptr) {
 return (__extension__ (
  { register unsigned int __v, __x = (*(U32 *)ptr);
    if (__builtin_constant_p (__x))
      __v = ((((__x) & 0xff000000) >> 24)
          |  (((__x) & 0x00ff0000) >> 8)
          | (((__x) & 0x0000ff00) << 8)
          | (((__x) & 0x000000ff) << 24));
    else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x));
     __v;
  }
 ));
}

The value is not a constant here so bswap is generated.

z900_vfetch4 is also inlined but too complicated to include as yet; the
first thing it does is check if effective_addr2 is on a 4 byte boundary.

So the i686 assembler looks like:

z900_load:
        pushl   %ebp
        movl    %edx, %ebp
        pushl   %edi
        xorl    %edi, %edi
        pushl   %esi
        subl    $80, %esp
        movl    (%eax), %eax
        movl    %eax, 36(%esp)
#APP
        bswap %eax
#NO_APP
        movl    %eax, %ecx
        movl    %eax, 36(%esp)
        movl    36(%esp), %eax
        shrl    $16, %ecx
        movl    %eax, %esi
        movl    %ecx, %eax
        andl    $4095, %esi
        andl    $15, %eax
        je      .L5434
        addl    80(%edx,%eax,8), %esi
        adcl    84(%edx,%eax,8), %edi
.L5434:
        movl    36(%esp), %eax
        shrl    $12, %eax
        andl    $15, %eax
        movl    %eax, 52(%esp)
        je      .L5436
        addl    80(%ebp,%eax,8), %esi
        adcl    84(%ebp,%eax,8), %edi
.L5436:
        movl    32(%ebp), %edx
        addl    $4, 44(%ebp)
        andl    %esi, %edx
        movl    %edx, 56(%esp)
        movl    36(%ebp), %eax
        andl    %edi, %eax
        movl    %eax, 60(%esp)
        movb    $4, 42(%ebp)
        movl    56(%esp), %edi
        testl   $3, %edi
        jne     .L5476

which is a vast improvement over the original assembler I reported
(where %edx or parm 2 was constantly reloaded).  I am still disappointed
in the code surrounding fetch_fw:

        movl    %eax, 36(%esp)
        bswap %eax
        movl    %eax, %ecx
        movl    %eax, 36(%esp)
        movl    36(%esp), %eax

That is storing %eax in 36(%esp), swapping it, then storing it back into
36(%esp), then reloading %eax from 36(%esp).

Greg Smith