2008/9/24 John Fine <johnsfine@xxxxxxxxxxx>:> Łukasz Lew wrote:>>>> I fixed the problem (I think) with rdtsc on 64bit architectures.>> http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz>>>> Seems to work. Why was it previously correct for 32 bit? Did the 32 bit> compiler already combine the correct two registers? I have no idea?But it seems to not compile on 32bit. >>>>>> You may be very right about the register allocation.>>> I tuned my code on 4.2 and small "irrelevant" changes changed the>>> perfomance badly>>> and asm output revealed among other things different registers.>>>>> That doesn't really prove much. Without some very good output from> Opannotate, I don't know how to tell the real reason for the performance> difference.indeed, but opannotate on assembler doesn't give here muchthe 10% difference is spread irregulary. some parts are slower, some are faster.but asm of both versions correspond to each other very well exceptdifferen registers and offsets. >>>>> I use Oprofile a lot, and tried to pinpoint the difference but asm>>> output is too different>>> while c++ annotation is too weak because of heavy inlining.>>>>> I'm trying to understand and/or fix the use of Opannotate for some much> harder problems, so I was curious enough to try it on your program. I> compiled your program x86_64 with gcc 4.4. Even if I got good results, that> wouldn't tell you anything about 32 bit gcc 4.3. Can you send me the log from my benchmark?And your processor model? If you can do the same for g++4.3, that would be very useful for me. >> But I got surprisingly bad results. I haven't previously seen such bad> results from opannotate without using heavily templated code. But I also> haven't used a gcc 4.4 compiled program with opannotate before.>> In --source mode nearly all the total time was missing (not associated with> any source line). I have the same problem with g++-4.3.My guess that this is due to heavy inlining.btw. you would be surprised how much slower it gets if you turn offallways inline gcc attribute. > In mixed source and assembly view, I think all the time Is it possible to get mixed view? > was shown, but I don't think the assembly code corresponded very accurately> with the source code and the time was in some very surprising lumps. I> usually can interpret such lumps (usually the instruction after an L2 cache> miss or the instruction after a mispredicted branch). But that didn't seem> to fit the execution time lumps in your code. L1 misses hit my code performance as well. >> The few points in your source code that had most of the total execution time> were inlined multiple times with different register usage each time. No one> inline copy of any such routine had as much as 4% of the total execution> time. That tends to wreck the theory that a minor change somewhere has> caused a big difference by changing register allocation. Can you be more specific?How do you know which part was inlined where? > There wouldn't be> that sort of correlation in the way it changes register allocation across a> bunch of different inlinings of the same function that already differ from> each other in register allocation. but do you observe the 10% difference in performance that I have on my machine? This is getting promising, thanks for your help.Lukasz PSIs there any alternative for OProfile?If not, then why it is so undeveloped?