restrict leaving byte copies unoptimized

Dan Dickerman <dan09@xxxxxxxxxxxxxxxxxx> · Wed, 17 Mar 2010 15:20:59 -0700

I'm trying to clean-up some code which does not currently take advantage of 
the possible optimizations of the __restrict__ qualifier. Using gcc with the 
Xcode-bundled tools on OSX for PPC, gcc versions 4.0.1 and 4.2.1, I can see 
the differences of strict-aliasing if I use (4-byte) words to copy from one 
place to another, but not when using byte pointers.

I've attached the over-simplified test-case demonstrate this issue. I can't 
help but think I'm missing something obvious, but it basically boils-down to 
this: the setup starts with a pair of (unsigned char*) and declares restricted 
pointers of the right type, which for words is:

#define SRC w_src
#define DST w_dst
        uint32       * __restrict__ w_dst = d;
        uint32 const * __restrict__ w_src = s;

It then copies through these pointers with a series of:
#define COPY_WORD(dst,src,i)    \
        *(dst+((i)/4)) = *(src+((i)/4));

=== This macro is called in an unrolled loop with i=0, 4, 8, 12...

With general optimizations on (-O3 -fgcse-sm) this produces code which gets 
flagged (by Shark) as producing stalls, as the register loads need to wait for 
memory before proceeding with the corresponding stores:
	...
        lwz r0,4(r2)
        stw r0,4(r9)
        lwz r11,8(r2)
        stw r11,8(r9)
        lwz r0,12(r2)
        stw r0,12(r9)
        lwz r11,16(r2)
        stw r11,16(r9)
	...

If the option -fstrict-aliasing is added to the list to take advantage of the 
__restrict__ qualifier, it re-orders the instructions as I expect, so the 
loads fill the registers before proceeding with the stores...
	...
        lwz r3,32(r2)
        lwz r29,36(r2)
        lwz r28,40(r2)
        lwz r27,44(r2)
        lwz r26,48(r2)
        lwz r25,52(r2)
        lwz r24,56(r2)
        lwz r23,60(r2)
        lwz r22,64(r2)
        stwx r11,r12,r21
        stw r10,4(r9)
        add r12,r12,r19
        stw r0,8(r9)
        stw r8,12(r9)
        stw r7,16(r9)
        stw r6,20(r9)
	...

When the code is changed to use byte pointers (which would seem the simpler 
case, avoiding the incompatible pointer type warnings, for starters), the 
loads/stores get padded with nops and -fstrict-aliasing appears to have no 
effect. The C level changes look like:

#define SRC b_src
#define DST b_dst
        uchar       * __restrict__ b_dst = d;
        uchar const * __restrict__ b_src = s;

#define COPY_WORD(dst,src,i)    \
        *(dst+(i  )) = *(src+(i  )); \
        *(dst+(i+1)) = *(src+(i+1)); \
        *(dst+(i+2)) = *(src+(i+2)); \
        *(dst+(i+3)) = *(src+(i+3));

The compiled code, optimized as above is:
	...
        lbz r0,1(r2)
        stb r0,1(r9)
        nop
        nop
        lbz r11,2(r2)
        stb r11,2(r9)
        nop
        nop
        lbz r0,3(r2)
        stb r0,3(r9)
        nop
        nop
        lbz r11,4(r2)
        stb r11,4(r9)
        nop
        nop
	...

Can someone suggest where the nops are coming from and how I can encourage gcc 
to re-order the loads/stores as is done for words to streamline this code a 
bit? As a related note, is Shark making a valid complaint about the stalls 
this code produces, or can I sleep comfortably knowing that the cache will 
simply make all these memory references into single-cycle instructions anyway?
--
  Dan Dickerman
Attachment:
restrictMe.tgz

Description: GNU Zip compressed data