Re: Advice about using SIMD extensions

Richard Beare <Richard.Beare@xxxxxxxx> · Sun, 27 Feb 2005 11:24:09 +1100

Hi everyone,

I've done a few more experiments using various pieces of advice that 
have come back from the list. They shed some light on the problems I've 
been having.

Before I summarize the results I should mention that my initial 
motivation for this was to improve arithmetic operations on images. Any 
changes I propose would need to be compatible with our imaging libraries 
in which an image is simply a 1D array.

The errors pointed out to me were:

1)  Use -march=pentium4 instead of -mcpu=pentium4

2) use
typedef float myvec __attribute__ ((vector_size (16)));

instead of

typedef int myvec __attribute__ ((mode(V4SF)));

The former is not compatible with newer versions of gcc (>3.4?).

These changes certainly improved the performance of the test I posted to 
the list.

However when I went back to test my image arithmetic code with these 
changes I found no difference.

I then did some more tests which are summarized in the attached graph - 
These demonstrate, I think, that I was experiencing a cache problem with 
my image code. The images I was experimenting with were 1600x1300, so 
way to large to fit in cache.

I now need to do some thinking, and more advice would be appreciated. 
I'm going to experiment with oprofile to see what it tells me, but 
haven't done so yet.

I had always thought that accessing array elements in raster order 
should be cache neutral, but it doesn't seem to be the case. I'm not 
sure what governs the size of the data being loaded into the cache.

Can anything be done about it without changing underlying data 
structures in my code?

As an aside, can anyone recommend example macros for unrolling loops?

Thanks very much.

Brian Budge wrote:
In the example above, it's not only register allocation, but also
scheduling.  The data needs to be loaded from memory, and how that
happens can affect performance quite a bit.

And yeah, I can't understand how 8.1 could get decent performance
without instruction scheduling... but maybe I'm stuck in my own little
RISC processing world (the (toy) compilers I have written have been
for SPARC and MIPS), and I just don't understand enough about how the
pentium works.

  Brian

On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote:

On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote:

Hmmm, I doubt that.  It seems very important that your data be in
registers when you want to do arithmetic on it.

That's register allocation, not scheduling :)

I can see that if your data was already in registers, maybe a
"randomized" instruction ordering would perform okay, but loading the
data properly is time consuming.  At least these are the things I've
observed.

stevenb was the source of this information for me, so maybe he can
confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do
scheduling for the pentium4 anymore, and he doubts it :P)

--
Richard Beare, CSIRO Mathematical & Information Sciences
Locked Bag 17, North Ryde, NSW 1670, Australia
Phone: +61-2-93253221 (GMT+~10hrs)  Fax: +61-2-93253200

Richard.Beare@xxxxxxxx
Attachment:
relative_speed.pdf

Description: Adobe PDF document