Re: Performance problem

John Fine <johnsfine@xxxxxxxxxxx> · Wed, 24 Sep 2008 22:33:53 -0400

Łukasz Lew wrote:
Can you send me the log from my benchmark?
And your processor model?

64 bit compiled with gcc 4.4 is a little faster thanwith 4.1.2
Logs and proc/cpuinfo emailed just to you.  I don't think most on 
gcc-help want to see all that.
If you can do the same for g++4.3, that would be very useful for me.

I don't have 4.3 installed here.  Maybe I'll get a chance elsewhere.
Is it possible to get mixed view?
In Opannotate, specify both --source and --assembly and it gives you mixed.

Mixed is really ugly.  Why run Oprofile at all if you're not compiling 
with optimization, but how can you expect mixed assembly and source view 
to make any sense after optimization.

The sane view, so far as I can tell, isn't available.  Maybe I'll figure 
out how to add it.  It should be an assembly view with an extra column 
on each line (probably after the stats and before the address) giving 
the source line number.

Obviously Opannotate calls something that has a vague idea of the source 
line for each asm line (or mixed mode wouldn't be possible).  Obviously, 
Opannotate isn't consistent in the way it uses that data or source view 
wouldn't miss almost everything despite this being such a simple program.

For example, a chunk of the mixed mode output look like this:

              :  void load (const Board* save_board) {
              :    memcpy(this, save_board, sizeof(Board));
   27  0.0054 :  400d33:       mov    $0x602980,%edi
    2 4.0e-04 :  400d38:       mov    0x20(%rsp),%rsi
              :  400d3d:       mov    $0x199,%ecx
16639  3.3551 :  400d42:       rep movsq %ds:(%rsi),%es:(%rdi)
  328  0.0661 :  400d45:       jmpq   400f60 
<_ZN24simple_playout_benchmark3runEPK5Boardj+0x300>
              :  400d4a:       nopw   0x0(%rax,%rax,1)

But in source view, the load method and its memcpy line are shown with 
zero execution time, no source line has a value near as large as 16639 
and the total for all source lines is a tiny fraction of the correct total.

So Opannotate CAN associate addresses 400d33 through 400d4a with your 
source line containing the memcpy.  Unlike most of the rest of what I 
see in mixed view, that association is even correct.  But in source view 
it is still discarded.

Maybe source view intentionally discards anything inlined. But what a 
stupid thing to do.

I hope to find time to dig into the opannotate source code and figure 
some of this out.

Can you be more specific?
How do you know which part was inlined where?

For example, in mixed mode I see a couple lines of asm (no execution 
time) identified as being your source line
rep (ii, playout_cnt) {
followed directly by the "load" routine I quoted above, and with no 
return at the end of load.  Since the asm code at load is obviously 
correct and your source code calls load right after that rep thing, it 
is pretty obvious load was inlined at that point.
but do you observe the 10% difference in performance that I have on my machine?

No.
PS
Is there any alternative for OProfile?
If not, then why it is so undeveloped?

I sure would like to know.

More intrusive methods of profiling really don't fit the situations 
where I want profiling.  The raw sampling with minimal disruption that 
vtune can do in windows or oprofile in Linux, is exactly what I need and 
both those tools seem to be able to capture the data I want captured and 
both those tools seem to present the captured data through such a 
horrible combination of bugs and bad design as to make the results 
nearly useless.  (Then there is vtune for Linux, which I've also tried 
but never found any way to get any useful output at all).

So if you find something better, please tell me.