> minor nit because I see that very often. :) from the gcc man page: > > -march=cpu-type > Generate instructions for the machine type cpu-type. > The choices for cpu-type are the same as for -mcpu. > Moreover, specifying -march=cpu-type implies > -mcpu=cpu-type. > > The extra -mcpu is unnesessery :). Good to know! I've always been paranoid about those two, so I covered them twice in my builds. > And theres a -mmmx switch, but I don't know if it is that useful for DSP, > since the mmx instructions are integer only. One cost to mmx which doesn't exist for XMM (SSE) is that there is an expensive "FP mode switch" which comes into play. There is apparently an option to have GCC generate both x87 and sse code, but the documentation doesn't inspire confidence. > Good thing you didn't recommend -O3. I've seen instances where it was > much slower than -O2. Yeah, it's definitely a risker option. I've seen bloated binaries which don't execute as quickly as well. With some code, I've seen -O1 be faster than -O2 [like Dan Bernstein's FFT library to name just one]. The Intel compiler is alot better in this regard. In fact, I've yet to find code where the Intel compiler doesn't humiliate the GCC compiler's generated binaries. And yes, even if you have AMD hardware, the code's faster. The table below isn't technically 100% accurate, as the pipelining and instruction scheduling of the Pentium III is a bit of an oddball compared to both its successor and predecessor. I've found the code to generally run at its best on these AMD processors when tuned for these matching Intel CPUs regardless. AMD Athlon XP (Stepping 6 and later) ~== Pentium III (SSE) AMD Athlon (Stepping 4 and earlier) ~== Pentium II (MMX) AMD Duron Applebred (Stepping 8+) ~== Pentium III (SSE) AMD Duron Morgan or older (Step <= 6)~== Pentium II (MMX) Only Opterons have SSE2. No AMD CPUs support SSE3 currently, but those are petty improvements by comparison. SSE2 is a big deal, offering both FP and INT vector operations. I've used SSE2 to dramatically speed up things from cryptographic code, to pattern-matching stuff for fuzzy-spamsign detection. In 19 times out of 20, whenever someone gives me a binary which shows the Athlon XP to run it faster than an equivalently rated Pentium 4, when I recompile them both for optimal tune (for both Intel and AMD), the Pentium 4 blows it away. No questions asked, go make me a sammich, bitch. Hard data? How does 20-30% performance improvement in the OpenSSL crypto cores sound, and that's WITHOUT any SSE2 hand-rolled assembler. The Athlon XP seems to handle really bad code, like stuff compiled for a 386/486 with lots of unaligned accesses, etc much more gracefully than the Pentium 4. I've seen a commercial game server published by Interplay whose code was so bad, running it on a 3.2GHz P4 was slower than an XP2400+ [running the same exact OS + configuration]. As always, YMMV; I could be a Dark Agent of Sauron trying to lure you away from more expensive CPUs. =MB= -- A focus on Quality.