Hi everyone,
I've done a few more experiments using various pieces of advice that have come back from the list. They shed some light on the problems I've been having.
Before I summarize the results I should mention that my initial motivation for this was to improve arithmetic operations on images. Any changes I propose would need to be compatible with our imaging libraries in which an image is simply a 1D array.
The errors pointed out to me were:
1) Use -march=pentium4 instead of -mcpu=pentium4
2) use typedef float myvec __attribute__ ((vector_size (16)));
instead of
typedef int myvec __attribute__ ((mode(V4SF)));
The former is not compatible with newer versions of gcc (>3.4?).
These changes certainly improved the performance of the test I posted to the list.
However when I went back to test my image arithmetic code with these changes I found no difference.
I then did some more tests which are summarized in the attached graph - These demonstrate, I think, that I was experiencing a cache problem with my image code. The images I was experimenting with were 1600x1300, so way to large to fit in cache.
I now need to do some thinking, and more advice would be appreciated. I'm going to experiment with oprofile to see what it tells me, but haven't done so yet.
I had always thought that accessing array elements in raster order should be cache neutral, but it doesn't seem to be the case. I'm not sure what governs the size of the data being loaded into the cache.
Can anything be done about it without changing underlying data structures in my code?
As an aside, can anyone recommend example macros for unrolling loops?
Thanks very much.
Brian Budge wrote:
In the example above, it's not only register allocation, but also scheduling. The data needs to be loaded from memory, and how that happens can affect performance quite a bit.
And yeah, I can't understand how 8.1 could get decent performance without instruction scheduling... but maybe I'm stuck in my own little RISC processing world (the (toy) compilers I have written have been for SPARC and MIPS), and I just don't understand enough about how the pentium works.
Brian
On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <dberlin@xxxxxxxxxxx> wrote:
On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote:
Hmmm, I doubt that. It seems very important that your data be in registers when you want to do arithmetic on it.
That's register allocation, not scheduling :)
I can see that if your data was already in registers, maybe a "randomized" instruction ordering would perform okay, but loading the data properly is time consuming. At least these are the things I've observed.
stevenb was the source of this information for me, so maybe he can confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do scheduling for the pentium4 anymore, and he doubts it :P)
-- Richard Beare, CSIRO Mathematical & Information Sciences Locked Bag 17, North Ryde, NSW 1670, Australia Phone: +61-2-93253221 (GMT+~10hrs) Fax: +61-2-93253200
Richard.Beare@xxxxxxxx
Attachment:
relative_speed.pdf
Description: Adobe PDF document