restrict leaving byte copies unoptimized

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm trying to clean-up some code which does not currently take advantage of the possible optimizations of the __restrict__ qualifier. Using gcc with the Xcode-bundled tools on OSX for PPC, gcc versions 4.0.1 and 4.2.1, I can see the differences of strict-aliasing if I use (4-byte) words to copy from one place to another, but not when using byte pointers.

I've attached the over-simplified test-case demonstrate this issue. I can't help but think I'm missing something obvious, but it basically boils-down to this: the setup starts with a pair of (unsigned char*) and declares restricted pointers of the right type, which for words is:

#define SRC w_src
#define DST w_dst
        uint32       * __restrict__ w_dst = d;
        uint32 const * __restrict__ w_src = s;

It then copies through these pointers with a series of:
#define COPY_WORD(dst,src,i)    \
        *(dst+((i)/4)) = *(src+((i)/4));

=== This macro is called in an unrolled loop with i=0, 4, 8, 12...

With general optimizations on (-O3 -fgcse-sm) this produces code which gets flagged (by Shark) as producing stalls, as the register loads need to wait for memory before proceeding with the corresponding stores:
	...
        lwz r0,4(r2)
        stw r0,4(r9)
        lwz r11,8(r2)
        stw r11,8(r9)
        lwz r0,12(r2)
        stw r0,12(r9)
        lwz r11,16(r2)
        stw r11,16(r9)
	...


If the option -fstrict-aliasing is added to the list to take advantage of the __restrict__ qualifier, it re-orders the instructions as I expect, so the loads fill the registers before proceeding with the stores...
	...
        lwz r3,32(r2)
        lwz r29,36(r2)
        lwz r28,40(r2)
        lwz r27,44(r2)
        lwz r26,48(r2)
        lwz r25,52(r2)
        lwz r24,56(r2)
        lwz r23,60(r2)
        lwz r22,64(r2)
        stwx r11,r12,r21
        stw r10,4(r9)
        add r12,r12,r19
        stw r0,8(r9)
        stw r8,12(r9)
        stw r7,16(r9)
        stw r6,20(r9)
	...

When the code is changed to use byte pointers (which would seem the simpler case, avoiding the incompatible pointer type warnings, for starters), the loads/stores get padded with nops and -fstrict-aliasing appears to have no effect. The C level changes look like:

#define SRC b_src
#define DST b_dst
        uchar       * __restrict__ b_dst = d;
        uchar const * __restrict__ b_src = s;

#define COPY_WORD(dst,src,i)    \
        *(dst+(i  )) = *(src+(i  )); \
        *(dst+(i+1)) = *(src+(i+1)); \
        *(dst+(i+2)) = *(src+(i+2)); \
        *(dst+(i+3)) = *(src+(i+3));


The compiled code, optimized as above is:
	...
        lbz r0,1(r2)
        stb r0,1(r9)
        nop
        nop
        lbz r11,2(r2)
        stb r11,2(r9)
        nop
        nop
        lbz r0,3(r2)
        stb r0,3(r9)
        nop
        nop
        lbz r11,4(r2)
        stb r11,4(r9)
        nop
        nop
	...

Can someone suggest where the nops are coming from and how I can encourage gcc to re-order the loads/stores as is done for words to streamline this code a bit? As a related note, is Shark making a valid complaint about the stalls this code produces, or can I sleep comfortably knowing that the cache will simply make all these memory references into single-cycle instructions anyway?
--
  Dan Dickerman

Attachment: restrictMe.tgz
Description: GNU Zip compressed data


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux