Hi Richard - Take this with a grain of salt, as I am also not so hot with the SSE stuff, but have you tried the prefetch instructions provided in xmmintrin.h? Brian On Sun, 27 Feb 2005 11:24:09 +1100, Richard Beare <Richard.Beare@xxxxxxxx> wrote: > Hi everyone, > > I've done a few more experiments using various pieces of advice that > have come back from the list. They shed some light on the problems I've > been having. > > Before I summarize the results I should mention that my initial > motivation for this was to improve arithmetic operations on images. Any > changes I propose would need to be compatible with our imaging libraries > in which an image is simply a 1D array. > > The errors pointed out to me were: > > 1) Use -march=pentium4 instead of -mcpu=pentium4 > > 2) use > typedef float myvec __attribute__ ((vector_size (16))); > > instead of > > typedef int myvec __attribute__ ((mode(V4SF))); > > The former is not compatible with newer versions of gcc (>3.4?). > > These changes certainly improved the performance of the test I posted to > the list. > > However when I went back to test my image arithmetic code with these > changes I found no difference. > > I then did some more tests which are summarized in the attached graph - > These demonstrate, I think, that I was experiencing a cache problem with > my image code. The images I was experimenting with were 1600x1300, so > way to large to fit in cache. > > I now need to do some thinking, and more advice would be appreciated. > I'm going to experiment with oprofile to see what it tells me, but > haven't done so yet. > > I had always thought that accessing array elements in raster order > should be cache neutral, but it doesn't seem to be the case. I'm not > sure what governs the size of the data being loaded into the cache. > > Can anything be done about it without changing underlying data > structures in my code? > > As an aside, can anyone recommend example macros for unrolling loops? > > Thanks very much. > > Brian Budge wrote: > > In the example above, it's not only register allocation, but also > > scheduling. The data needs to be loaded from memory, and how that > > happens can affect performance quite a bit. > > > > And yeah, I can't understand how 8.1 could get decent performance > > without instruction scheduling... but maybe I'm stuck in my own little > > RISC processing world (the (toy) compilers I have written have been > > for SPARC and MIPS), and I just don't understand enough about how the > > pentium works. > > > > Brian > > > > > > On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote: > > > >>On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote: > >> > >>>Hmmm, I doubt that. It seems very important that your data be in > >>>registers when you want to do arithmetic on it. > >> > >>That's register allocation, not scheduling :) > >> > >> > >>>I can see that if your data was already in registers, maybe a > >>>"randomized" instruction ordering would perform okay, but loading the > >>>data properly is time consuming. At least these are the things I've > >>>observed. > >>> > >> > >>stevenb was the source of this information for me, so maybe he can > >>confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do > >>scheduling for the pentium4 anymore, and he doubts it :P) > >> > >> > > -- > Richard Beare, CSIRO Mathematical & Information Sciences > Locked Bag 17, North Ryde, NSW 1670, Australia > Phone: +61-2-93253221 (GMT+~10hrs) Fax: +61-2-93253200 > > Richard.Beare@xxxxxxxx > > >