* On Tue Dec 20 11:34:35 +0100 2011, Jonathan Wakely wrote: > > I have been reducing the program to see what the smallest code is that still > > shows this behaviour. Latest version is below. > > > > $ gcc -msse -mfpmath=sse -O3 -march=native test.c > > What is "native" for your system, i686? (also, what does gcc -dumpmachine show?) i486-linux-gnu > i686 doesn't support SSE, you need at least pentium3. > > Remove the -msse and see if you get a warning telling you SSE > instructions are disabled. True > Try -march=pentium3 -mfpmath=sse instead (without -msse) > > If you don't have at least a pentium3, you're stuck with the 387 FP > registers, and have to use horrible code. > That looks as though you're still not using SSE registers. The inner loop boils down to this (-msse -mfpmath=sse -O3 -march=native) 8048370: 66 0f 28 c1 movapd %xmm1,%xmm0 8048374: 83 e8 01 sub $0x1,%eax 8048377: f2 0f 59 c2 mulsd %xmm2,%xmm0 804837b: 66 0f 28 c8 movapd %xmm0,%xmm1 804837f: f2 0f 59 ca mulsd %xmm2,%xmm1 8048383: 75 eb jne 8048370 <main+0x40> or this (-march=pentium3 -mfpmath=sse -O3) 8048360: dd d9 fstp %st(1) 8048362: 83 e8 01 sub $0x1,%eax 8048365: d8 c9 fmul %st(1),%st 8048367: d9 c0 fld %st(0) 8048369: d8 ca fmul %st(2),%st 804836b: 75 f3 jne 8048360 <main+0x30 The first runs about twice as fast as the latter, but still I see a huge difference in run time depending on the 'f' in the original code -- :wq ^X^Cy^K^X^C^C^C^C