Łukasz Lew wrote:
I fixed the problem (I think) with rdtsc on 64bit architectures.
http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz
Seems to work. Why was it previously correct for 32 bit? Did the 32
bit compiler already combine the correct two registers?
You may be very right about the register allocation.
I tuned my code on 4.2 and small "irrelevant" changes changed the
perfomance badly
and asm output revealed among other things different registers.
That doesn't really prove much. Without some very good output from
Opannotate, I don't know how to tell the real reason for the performance
difference.
I use Oprofile a lot, and tried to pinpoint the difference but asm
output is too different
while c++ annotation is too weak because of heavy inlining.
I'm trying to understand and/or fix the use of Opannotate for some much
harder problems, so I was curious enough to try it on your program. I
compiled your program x86_64 with gcc 4.4. Even if I got good results,
that wouldn't tell you anything about 32 bit gcc 4.3.
But I got surprisingly bad results. I haven't previously seen such bad
results from opannotate without using heavily templated code. But I
also haven't used a gcc 4.4 compiled program with opannotate before.
In --source mode nearly all the total time was missing (not associated
with any source line). In mixed source and assembly view, I think all
the time was shown, but I don't think the assembly code corresponded
very accurately with the source code and the time was in some very
surprising lumps. I usually can interpret such lumps (usually the
instruction after an L2 cache miss or the instruction after a
mispredicted branch). But that didn't seem to fit the execution time
lumps in your code.
The few points in your source code that had most of the total execution
time were inlined multiple times with different register usage each
time. No one inline copy of any such routine had as much as 4% of the
total execution time. That tends to wreck the theory that a minor
change somewhere has caused a big difference by changing register
allocation. There wouldn't be that sort of correlation in the way it
changes register allocation across a bunch of different inlinings of the
same function that already differ from each other in register allocation.