Re: Performance problem

John Fine <johnsfine@xxxxxxxxxxx> · Sun, 21 Sep 2008 19:39:03 -0400

I was curious, so I tried running your benchmark.  It was too fast for 
meaningful results, so I increased the counts int the calls to 
simple_playout_benchmark::run and I noticed some negative and generally 
unstable values for "clock cycles per playout".

So your code:

 uint64 get_cc_time () volatile {
   uint64 ret;
   __asm__ __volatile__("rdtsc" : "=A" (ret) : :);
   return ret;
 }

gives me values that aren't even monotonic.

I'm on a 64-bit dual core AMD system.  My best guess is that the program 
switches cores part way through the loop. But I really don't know enough 
about either rdtsc or __asm__ __volatile__ to know whether there might 
be other reasons.

Are you running on a single core system?  Or otherwise controlling for 
such effects?

In other projects, I've found that Oprofile is very effective in 
tracking down the direct cause of performance differences.  Have you 
tried that?  In much of what I do, the direct cause of a performance 
difference is just a hint at the indirect true cause.  But in an example 
as simple as you've provided, the direct cause is the cause.

Are you building for 32-bit or 64-bit?

In 32-bit, gcc is really bad at dealing with the architecture's shortage 
of registers.  A tiny change anywhere can change gcc's register choices 
leading into the critical loop and either cause or avoid a register 
spill.  That alone could cause a 10% difference.

Łukasz Lew wrote:
I extracted only the benchmark part:
http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz