I've attached the over-simplified test-case demonstrate this issue. I can't help but think I'm missing something obvious, but it basically boils-down to this: the setup starts with a pair of (unsigned char*) and declares restricted pointers of the right type, which for words is:
#define SRC w_src #define DST w_dst uint32 * __restrict__ w_dst = d; uint32 const * __restrict__ w_src = s; It then copies through these pointers with a series of: #define COPY_WORD(dst,src,i) \ *(dst+((i)/4)) = *(src+((i)/4)); === This macro is called in an unrolled loop with i=0, 4, 8, 12...With general optimizations on (-O3 -fgcse-sm) this produces code which gets flagged (by Shark) as producing stalls, as the register loads need to wait for memory before proceeding with the corresponding stores:
... lwz r0,4(r2) stw r0,4(r9) lwz r11,8(r2) stw r11,8(r9) lwz r0,12(r2) stw r0,12(r9) lwz r11,16(r2) stw r11,16(r9) ...If the option -fstrict-aliasing is added to the list to take advantage of the __restrict__ qualifier, it re-orders the instructions as I expect, so the loads fill the registers before proceeding with the stores...
... lwz r3,32(r2) lwz r29,36(r2) lwz r28,40(r2) lwz r27,44(r2) lwz r26,48(r2) lwz r25,52(r2) lwz r24,56(r2) lwz r23,60(r2) lwz r22,64(r2) stwx r11,r12,r21 stw r10,4(r9) add r12,r12,r19 stw r0,8(r9) stw r8,12(r9) stw r7,16(r9) stw r6,20(r9) ...When the code is changed to use byte pointers (which would seem the simpler case, avoiding the incompatible pointer type warnings, for starters), the loads/stores get padded with nops and -fstrict-aliasing appears to have no effect. The C level changes look like:
#define SRC b_src #define DST b_dst uchar * __restrict__ b_dst = d; uchar const * __restrict__ b_src = s; #define COPY_WORD(dst,src,i) \ *(dst+(i )) = *(src+(i )); \ *(dst+(i+1)) = *(src+(i+1)); \ *(dst+(i+2)) = *(src+(i+2)); \ *(dst+(i+3)) = *(src+(i+3)); The compiled code, optimized as above is: ... lbz r0,1(r2) stb r0,1(r9) nop nop lbz r11,2(r2) stb r11,2(r9) nop nop lbz r0,3(r2) stb r0,3(r9) nop nop lbz r11,4(r2) stb r11,4(r9) nop nop ...Can someone suggest where the nops are coming from and how I can encourage gcc to re-order the loads/stores as is done for words to streamline this code a bit? As a related note, is Shark making a valid complaint about the stalls this code produces, or can I sleep comfortably knowing that the cache will simply make all these memory references into single-cycle instructions anyway?
-- Dan Dickerman
Attachment:
restrictMe.tgz
Description: GNU Zip compressed data