I was curious, so I tried running your benchmark. It was too fast for
meaningful results, so I increased the counts int the calls to
simple_playout_benchmark::run and I noticed some negative and generally
unstable values for "clock cycles per playout".
So your code:
uint64 get_cc_time () volatile {
uint64 ret;
__asm__ __volatile__("rdtsc" : "=A" (ret) : :);
return ret;
}
gives me values that aren't even monotonic.
I'm on a 64-bit dual core AMD system. My best guess is that the program
switches cores part way through the loop. But I really don't know enough
about either rdtsc or __asm__ __volatile__ to know whether there might
be other reasons.
Are you running on a single core system? Or otherwise controlling for
such effects?
In other projects, I've found that Oprofile is very effective in
tracking down the direct cause of performance differences. Have you
tried that? In much of what I do, the direct cause of a performance
difference is just a hint at the indirect true cause. But in an example
as simple as you've provided, the direct cause is the cause.
Are you building for 32-bit or 64-bit?
In 32-bit, gcc is really bad at dealing with the architecture's shortage
of registers. A tiny change anywhere can change gcc's register choices
leading into the critical loop and either cause or avoid a register
spill. That alone could cause a 10% difference.
Łukasz Lew wrote:
I extracted only the benchmark part:
http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz