Re: template classes faster than derived classes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nava Whiteford wrote:

In this case the templated version doesn't seem to have the same huge
advantage. Templated 20.73s against 21.1s for the classed version. I would guess
real, but not huge.

Do these numbers seem reasonable?


I don't know what exactly the compiler optimized out, especially whether it changed the divide by 2.2 into a multiple by the reciprocal.

I assume it optimized away the call for the template version. Apparently it no longer optimizes away the whole loop for the templated version. Apparently it doesn't optimize away the vtable lookup nor the call for the non templated version.

Because the branch is the same every time through the loop, there is no branch misprediction on the call. Similarly no cache misses on the push and pop of the return address etc. That makes the difference between inlined and virtual call a lot smaller in this test than it would be in average use. But not as small as you measured. There is a bigger factor.

CPU's overlap a lot of operations. They especially overlap things like floating point divide with all the flow of control things involved in that virtual call.

I'm not certain, but I think the optimized code combined with ability of the CPU to execute ahead may mean the floating point divide (or maybe even the reciprocal multiply) is still pending as the CPU goes ahead into the next iteration of c->get_i()

So if c->get_i() is super fast (inlined) it may finish and then the CPU must wait for the divide before going further. If c->get_i() is much slower it still may be only a trivial amount slower than the divide, so the overlapped time is about the same.

If there were no such overlap, I'd expect a bigger difference between inline and virtual. If a divide were overlapped, I'd expect a virtual call with no branch mis prediction to be entirely covered by the overlap, so no difference in total execution time. So your result seems to fit a multiply overlapped with the virtual call. But I'm far from sure. I'd need to see the generated asm code to have even a better guess.

As for the main question: Like most performance questions, simple tests lead to consistently distorted answers. Performance is a complex question.


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux