On 12/15/06, Ron <rjpeace@xxxxxxxxxxxxx> wrote:
I'm looking more closely into exactly what the various gcc -O optimizations do on Kx's as well. 64b vs 32b gets x86 compatible code access to ~ 2x as many registers; and MMX or SSE instructions get you access to not only more registers, but wider ones as well. As one wit has noted, "all optimization is an exercise in caching." (Terje Mathisen- one of the better assembler coders on the planet.) It seems unusual that code generation options which give access to more registers would ever result in slower code...
The slower is probably due to the unroll loops switch which can actually hurt code due to the larger footprint (less cache coherency). The extra registers are not all that important because of pipelining and other hardware tricks. Pretty much all the old assembly strategies such as forcing local variables to registers are basically obsolete...especially with regards to integer math. As I said before, modern CPUs are essentially RISC engines with a CISC preprocessing engine laid in top. Things are much more complicated than they were in the old days where you could count instructions for the assembly optimization process. I suspect that there is little or no differnece between the -march=686 and the various specifc archicectures. Did anybody think to look at the binaries and look for the amount of differences? I bet you code compiled for march=opteron will just fine on a pentium 2 if compiled for 32 bit. merlin