Strange performance behaviour

Harald Grossauer <harald.grossauer@xxxxxxxxxx> · Wed, 06 Feb 2008 14:36:42 +0100

We discovered a strange phenomenon in some code for solving tridiagonal
system.

The following code highlights the problem:

#include <stdlib.h>
int main(int argc, char** argv) {
	int i, j;
	float temp;
	const int N = 1024;
	const int Nruns = 100000;
	float * x = (float * ) malloc(N*sizeof(float));
	for( j = 0; j < Nruns; ++j ) {
		for( i = 0; i < N; ++i ) x[i] = i+1;
		// temp = x[N-1]; // Variant 2
		for( i = N-2; i >= 0; --i ) {
			// temp = x[i] = 1.0f + 2.0f*temp; // Variant 2
			x[i] = 1.0f + 2.0f*x[i+1]; // Variant 1
		}
	}
}

Look at the timings:

gcc -O2 test.c
time ./a.out

real    0m14.347s
user    0m14.301s
sys     0m0.002s

After commenting out Variant 1 and activating both lines from Variant 2:

gcc -O2 test.c
time ./a.out

real    0m11.541s
user    0m11.466s
sys     0m0.008s

Variant 1 again:

gcc -O2 -msse -mfpmath=sse test.c
time ./a.out

real    0m0.676s
user    0m0.672s
sys     0m0.002s

Variant 2:
gcc -O2 -msse -mfpmath=sse test.c
time ./a.out

real    0m0.426s
user    0m0.424s
sys     0m0.001s

Why does the introduction of the temporary variable give a performance
increase of 20-30% ?
Why is the SSE version so much faster in this case?

I thought that GCC >4 uses SSE per default for FP math?