Re: Performance problem

John Fine <johnsfine@xxxxxxxxxxx> · Wed, 24 Sep 2008 11:21:01 -0400

Łukasz Lew wrote:
I fixed the problem (I think) with rdtsc on 64bit architectures.
http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz

Seems to work.  Why was it previously correct for 32 bit?  Did the 32 
bit compiler already combine the correct two registers?
You may be very right about the register allocation.
I tuned my code on 4.2 and small "irrelevant" changes changed the
perfomance badly
and asm output revealed among other things different registers.

That doesn't really prove much.  Without some very good output from 
Opannotate, I don't know how to tell the real reason for the performance 
difference.

I use Oprofile a lot, and tried to pinpoint the difference but asm
output is too different
while c++ annotation  is too weak because of heavy inlining.

I'm trying to understand and/or fix the use of Opannotate for some much 
harder problems, so I was curious enough to try it on your program.  I 
compiled your program x86_64 with gcc 4.4.  Even if I got good results, 
that wouldn't tell you anything about 32 bit gcc 4.3.

But I got surprisingly bad results.  I haven't previously seen such bad 
results from opannotate without using heavily templated code.  But I 
also haven't used a gcc 4.4 compiled program with opannotate before.

In --source mode nearly all the total time was missing (not associated 
with any source line).  In mixed source and assembly view, I think all 
the time was shown, but I don't think the assembly code corresponded 
very accurately with the source code and the time was in some very 
surprising lumps.  I usually can interpret such lumps (usually the 
instruction after an L2 cache miss or the instruction after a 
mispredicted branch).  But that didn't seem to fit the execution time 
lumps in your code.

The few points in your source code that had most of the total execution 
time were inlined multiple times with different register usage each 
time.  No one inline copy of any such routine had as much as 4% of the 
total execution time.  That tends to wreck the theory that a minor 
change somewhere has caused a big difference by changing register 
allocation.  There wouldn't be that sort of correlation in the way it 
changes register allocation across a bunch of different inlinings of the 
same function that already differ from each other in register allocation.