Re: Advice about using SIMD extensions

Brian Budge <brian.budge@xxxxxxxxx> · Mon, 28 Feb 2005 11:46:33 +0100

Hi Richard -

Take this with a grain of salt, as I am also not so hot with the SSE
stuff, but have you tried the prefetch instructions provided in
xmmintrin.h?

  Brian

On Sun, 27 Feb 2005 11:24:09 +1100, Richard Beare
<Richard.Beare@xxxxxxxx> wrote:
> Hi everyone,
> 
> I've done a few more experiments using various pieces of advice that
> have come back from the list. They shed some light on the problems I've
> been having.
> 
> Before I summarize the results I should mention that my initial
> motivation for this was to improve arithmetic operations on images. Any
> changes I propose would need to be compatible with our imaging libraries
> in which an image is simply a 1D array.
> 
> The errors pointed out to me were:
> 
> 1)  Use -march=pentium4 instead of -mcpu=pentium4
> 
> 2) use
> typedef float myvec __attribute__ ((vector_size (16)));
> 
> instead of
> 
> typedef int myvec __attribute__ ((mode(V4SF)));
> 
> The former is not compatible with newer versions of gcc (>3.4?).
> 
> These changes certainly improved the performance of the test I posted to
> the list.
> 
> However when I went back to test my image arithmetic code with these
> changes I found no difference.
> 
> I then did some more tests which are summarized in the attached graph -
> These demonstrate, I think, that I was experiencing a cache problem with
> my image code. The images I was experimenting with were 1600x1300, so
> way to large to fit in cache.
> 
> I now need to do some thinking, and more advice would be appreciated.
> I'm going to experiment with oprofile to see what it tells me, but
> haven't done so yet.
> 
> I had always thought that accessing array elements in raster order
> should be cache neutral, but it doesn't seem to be the case. I'm not
> sure what governs the size of the data being loaded into the cache.
> 
> Can anything be done about it without changing underlying data
> structures in my code?
> 
> As an aside, can anyone recommend example macros for unrolling loops?
> 
> Thanks very much.
> 
> Brian Budge wrote:
> > In the example above, it's not only register allocation, but also
> > scheduling.  The data needs to be loaded from memory, and how that
> > happens can affect performance quite a bit.
> >
> > And yeah, I can't understand how 8.1 could get decent performance
> > without instruction scheduling... but maybe I'm stuck in my own little
> > RISC processing world (the (toy) compilers I have written have been
> > for SPARC and MIPS), and I just don't understand enough about how the
> > pentium works.
> >
> >   Brian
> >
> >
> > On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote:
> >
> >>On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote:
> >>
> >>>Hmmm, I doubt that.  It seems very important that your data be in
> >>>registers when you want to do arithmetic on it.
> >>
> >>That's register allocation, not scheduling :)
> >>
> >>
> >>>I can see that if your data was already in registers, maybe a
> >>>"randomized" instruction ordering would perform okay, but loading the
> >>>data properly is time consuming.  At least these are the things I've
> >>>observed.
> >>>
> >>
> >>stevenb was the source of this information for me, so maybe he can
> >>confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do
> >>scheduling for the pentium4 anymore, and he doubts it :P)
> >>
> >>
> 
> --
> Richard Beare, CSIRO Mathematical & Information Sciences
> Locked Bag 17, North Ryde, NSW 1670, Australia
> Phone: +61-2-93253221 (GMT+~10hrs)  Fax: +61-2-93253200
> 
> Richard.Beare@xxxxxxxx
> 
> 
>