On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote: > > Lines 41..45 all deal with trying to figure out if the low-order 2 bits > > of effective_addr2 are zero. All I can say is, WTF? I can get around > > this one by casting effective_addr2 to U32 and then testl/jne is > > emitted, but I shouldn't have to do this?? > > This is most likely fixed by the lower-subreg patch I have been > working on. Latest version here: > > http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html I pulled the source from subversion today. There is no improvement in the emitted assembler or my benchmarks. However, with your subreg patch above against today's source, my benchmarks show a 10% improvement!! The emitted code looks (mostly) sane, no reloading registers with a value already in a register. It seems that once a certain level of complexity has been reached that the quality of the emitted assembler goes down, and your subreg patch is enough to get that complexity below the threshold. The particular routine looks like, processed: void (__attribute__ ( regparm(2) ) z900_load) (BYTE inst[], REGS *regs) { int r1; int b2; U64 effective_addr2; U32 temp = fetch_fw(inst); r1 = (temp >> 20) & 0xf; b2 = (temp >> 16) & 0xf; effective_addr2 = temp & 0xfff; if(b2) effective_addr2 += (regs)->gr[((b2))].D; // U64 b2 = (temp >> 12) & 0xf; if(b2) effective_addr2 += regs->gr[b2].D; // U64 effective_addr2 &= regs->psw.amask.D; // U64 regs->ip += 4; regs->psw.ilc = 4; regs->gr[r1].F.L.F = z900_vfetch4(effective_addr2, b2, regs); // U32 } fetch_fw is inlined and looks like: static __inline__ U32 fetch_fw(volatile void *ptr) { return (__extension__ ( { register unsigned int __v, __x = (*(U32 *)ptr); if (__builtin_constant_p (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; } )); } The value is not a constant here so bswap is generated. z900_vfetch4 is also inlined but too complicated to include as yet; the first thing it does is check if effective_addr2 is on a 4 byte boundary. So the i686 assembler looks like: z900_load: pushl %ebp movl %edx, %ebp pushl %edi xorl %edi, %edi pushl %esi subl $80, %esp movl (%eax), %eax movl %eax, 36(%esp) #APP bswap %eax #NO_APP movl %eax, %ecx movl %eax, 36(%esp) movl 36(%esp), %eax shrl $16, %ecx movl %eax, %esi movl %ecx, %eax andl $4095, %esi andl $15, %eax je .L5434 addl 80(%edx,%eax,8), %esi adcl 84(%edx,%eax,8), %edi .L5434: movl 36(%esp), %eax shrl $12, %eax andl $15, %eax movl %eax, 52(%esp) je .L5436 addl 80(%ebp,%eax,8), %esi adcl 84(%ebp,%eax,8), %edi .L5436: movl 32(%ebp), %edx addl $4, 44(%ebp) andl %esi, %edx movl %edx, 56(%esp) movl 36(%ebp), %eax andl %edi, %eax movl %eax, 60(%esp) movb $4, 42(%ebp) movl 56(%esp), %edi testl $3, %edi jne .L5476 which is a vast improvement over the original assembler I reported (where %edx or parm 2 was constantly reloaded). I am still disappointed in the code surrounding fetch_fw: movl %eax, 36(%esp) bswap %eax movl %eax, %ecx movl %eax, 36(%esp) movl 36(%esp), %eax That is storing %eax in 36(%esp), swapping it, then storing it back into 36(%esp), then reloading %eax from 36(%esp). Greg Smith