> i imagine you'd have a blast at somewhere like this: > > http://forums.gentoo.org/viewtopic.php?t=5717 > > (there's bad and good advice in there) Yeah... I'm not sure I see real improvements using -O3 and some of the options they advise. In fact, I've seen some reduced performance, probably due to over-aggressive loop-unrolling and whatnot. There's a subtle interplay between cache-unroll counts, L1 cache spilling, etc, etc, which is not a trivial problem. It's made worse by the fact that you don't know how many loop iterations is "normal" for fully dynamic loops, i.e., The Halting Problem. When I code in assembler, I find that unrolling to the point where the data access/processing stages are chunking one L1 cache line per iteration is probably ideal, since you can sprinkle successor data-cache-line touch-prefills partway through to get the first 4-8 bytes into the L1 transit area by the time your next iteration begins. I did this on the PowerPC 604 with stunning performance results for an md5 checksum routine. With these hyper-pipelined CPUs these days, stalls are very expensive. You also tend to be roadkill for subtle/bizarre bugs in the code optimiser when you crank it up to maximum levels like that. I instinctively distrust that zone, myself. There will always be some code which runs fastest at -O1, for example. Dan Bernstein's djfft library is one clear example. I know when I'm writing C code for a compiler I know is stupid (like the pre-GCC-3 compilers, or old Metrowerks compilers on the Macintosh), I tended to "guide" its code generation by expressing functionality in a way to get the compiler to produce the code the way I want it. It often looks "noisy" or "simple and inelegant", i.e., hoisting invariant stuff to temporary vars which are locally declared so the stack frame usage is kept to a minimum, or even better, the # of vars can fit nicely into registers. But it ends up compiling to faster-running code than "elegantly written" C did. I do not disagree with much of DJB's rants on the subject, but there are genuine cases where it's "harder than is worth it" to fully hint the compiler. In the case of Intel's compiler, it's bound to find a vectorising opportunity more readily than you are, unless you have plenty of free time on your hands. Compilers *ARE* improving... there was a time though when I couldn't rely on them to reduce an integer unsigned multiplication/division by a constant power-of-2 to a bitshift. These cycles do add up, people, and when you're doing them millions of times, it adds up to real seconds and minutes of your life. Naturally, this is all academic for a program you're only going to use once or twice and forget about. =MB= -- A focus on Quality.