Re: Performance problem

"Łukasz Lew" <lukasz.lew@xxxxxxxxx> · Thu, 25 Sep 2008 00:34:13 +0200



2008/9/24 John Fine <johnsfine@xxxxxxxxxxx>:> Łukasz Lew wrote:>>>> I fixed the problem (I think) with rdtsc on 64bit architectures.>> http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz>>>> Seems to work.  Why was it previously correct for 32 bit?  Did the 32 bit> compiler already combine the correct two registers?
I have no idea?But it seems to not compile on 32bit.
>>>>>> You may be very right about the register allocation.>>> I tuned my code on 4.2 and small "irrelevant" changes changed the>>> perfomance badly>>> and asm output revealed among other things different registers.>>>>> That doesn't really prove much.  Without some very good output from> Opannotate, I don't know how to tell the real reason for the performance> difference.indeed, but opannotate on assembler doesn't give here muchthe 10% difference is spread irregulary. some parts are slower, some are faster.but asm of both versions correspond to each other very well exceptdifferen registers and offsets.

>>>>> I use Oprofile a lot, and tried to pinpoint the difference but asm>>> output is too different>>> while c++ annotation  is too weak because of heavy inlining.>>>>> I'm trying to understand and/or fix the use of Opannotate for some much> harder problems, so I was curious enough to try it on your program.  I> compiled your program x86_64 with gcc 4.4.  Even if I got good results, that> wouldn't tell you anything about 32 bit gcc 4.3.
Can you send me the log from my benchmark?And your processor model?
If you can do the same for g++4.3, that would be very useful for me.

>> But I got surprisingly bad results.  I haven't previously seen such bad> results from opannotate without using heavily templated code.  But I also> haven't used a gcc 4.4 compiled program with opannotate before.>> In --source mode nearly all the total time was missing (not associated with> any source line).
I have the same problem with g++-4.3.My guess that this is due to heavy inlining.btw. you would be surprised how much slower it gets if you turn offallways inline gcc attribute.
> In mixed source and assembly view, I think all the time
Is it possible to get mixed view?
> was shown, but I don't think the assembly code corresponded very accurately> with the source code and the time was in some very surprising lumps.  I> usually can interpret such lumps (usually the instruction after an L2 cache> miss or the instruction after a mispredicted branch).  But that didn't seem> to fit the execution time lumps in your code.
L1 misses hit my code performance as well.
>> The few points in your source code that had most of the total execution time> were inlined multiple times with different register usage each time.  No one> inline copy of any such routine had as much as 4% of the total execution> time.  That tends to wreck the theory that a minor change somewhere has> caused a big difference by changing register allocation.
Can you be more specific?How do you know which part was inlined where?
>  There wouldn't be> that sort of correlation in the way it changes register allocation across a> bunch of different inlinings of the same function that already differ from> each other in register allocation.
but do you observe the 10% difference in performance that I have on my machine?
This is getting promising, thanks for your help.Lukasz
PSIs there any alternative for OProfile?If not, then why it is so undeveloped?