Re: Advice about using SIMD extensions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everyone,

I've done a few more experiments using various pieces of advice that have come back from the list. They shed some light on the problems I've been having.

Before I summarize the results I should mention that my initial motivation for this was to improve arithmetic operations on images. Any changes I propose would need to be compatible with our imaging libraries in which an image is simply a 1D array.

The errors pointed out to me were:

1)  Use -march=pentium4 instead of -mcpu=pentium4

2) use
typedef float myvec __attribute__ ((vector_size (16)));

instead of

typedef int myvec __attribute__ ((mode(V4SF)));

The former is not compatible with newer versions of gcc (>3.4?).

These changes certainly improved the performance of the test I posted to the list.

However when I went back to test my image arithmetic code with these changes I found no difference.

I then did some more tests which are summarized in the attached graph - These demonstrate, I think, that I was experiencing a cache problem with my image code. The images I was experimenting with were 1600x1300, so way to large to fit in cache.

I now need to do some thinking, and more advice would be appreciated. I'm going to experiment with oprofile to see what it tells me, but haven't done so yet.

I had always thought that accessing array elements in raster order should be cache neutral, but it doesn't seem to be the case. I'm not sure what governs the size of the data being loaded into the cache.

Can anything be done about it without changing underlying data structures in my code?


As an aside, can anyone recommend example macros for unrolling loops?

Thanks very much.

Brian Budge wrote:
In the example above, it's not only register allocation, but also
scheduling.  The data needs to be loaded from memory, and how that
happens can affect performance quite a bit.

And yeah, I can't understand how 8.1 could get decent performance
without instruction scheduling... but maybe I'm stuck in my own little
RISC processing world (the (toy) compilers I have written have been
for SPARC and MIPS), and I just don't understand enough about how the
pentium works.

  Brian


On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote:

On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote:

Hmmm, I doubt that.  It seems very important that your data be in
registers when you want to do arithmetic on it.

That's register allocation, not scheduling :)


I can see that if your data was already in registers, maybe a
"randomized" instruction ordering would perform okay, but loading the
data properly is time consuming.  At least these are the things I've
observed.


stevenb was the source of this information for me, so maybe he can confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do scheduling for the pentium4 anymore, and he doubts it :P)




--
Richard Beare, CSIRO Mathematical & Information Sciences
Locked Bag 17, North Ryde, NSW 1670, Australia
Phone: +61-2-93253221 (GMT+~10hrs)  Fax: +61-2-93253200

Richard.Beare@xxxxxxxx

Attachment: relative_speed.pdf
Description: Adobe PDF document


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux