We discovered a strange phenomenon in some code for solving tridiagonal system. The following code highlights the problem: #include <stdlib.h> int main(int argc, char** argv) { int i, j; float temp; const int N = 1024; const int Nruns = 100000; float * x = (float * ) malloc(N*sizeof(float)); for( j = 0; j < Nruns; ++j ) { for( i = 0; i < N; ++i ) x[i] = i+1; // temp = x[N-1]; // Variant 2 for( i = N-2; i >= 0; --i ) { // temp = x[i] = 1.0f + 2.0f*temp; // Variant 2 x[i] = 1.0f + 2.0f*x[i+1]; // Variant 1 } } } Look at the timings: gcc -O2 test.c time ./a.out real 0m14.347s user 0m14.301s sys 0m0.002s After commenting out Variant 1 and activating both lines from Variant 2: gcc -O2 test.c time ./a.out real 0m11.541s user 0m11.466s sys 0m0.008s Variant 1 again: gcc -O2 -msse -mfpmath=sse test.c time ./a.out real 0m0.676s user 0m0.672s sys 0m0.002s Variant 2: gcc -O2 -msse -mfpmath=sse test.c time ./a.out real 0m0.426s user 0m0.424s sys 0m0.001s Why does the introduction of the temporary variable give a performance increase of 20-30% ? Why is the SSE version so much faster in this case? I thought that GCC >4 uses SSE per default for FP math?