Re: template classes faster than derived classes?

"John S. Fine" <johnsfine@xxxxxxxxxxx> · Tue, 24 Nov 2009 16:22:51 -0500

Nava Whiteford wrote:

In this case the templated version doesn't seem to have the same huge
advantage. Templated 20.73s against 21.1s for the classed version. I would guess
real, but not huge.

Do these numbers seem reasonable?

I don't know what exactly the compiler optimized out, especially whether 
it changed the divide by 2.2 into a multiple by the reciprocal.

I assume it optimized away the call for the template version.  
Apparently it no longer optimizes away the whole loop for the templated 
version.  Apparently it doesn't optimize away the vtable lookup nor the 
call for the non templated version.

Because the branch is the same every time through the loop, there is no 
branch misprediction on the call.  Similarly no cache misses on the push 
and pop of the return address etc.  That makes the difference between 
inlined and virtual call a lot smaller in this test than it would be in 
average use.  But not as small as you measured.  There is a bigger factor.

CPU's overlap a lot of operations.  They especially overlap things like 
floating point divide with all the flow of control things involved in 
that virtual call.

I'm not certain, but I think the optimized code combined with ability of 
the CPU to execute ahead may mean the floating point divide (or maybe 
even the reciprocal multiply) is still pending as the CPU goes ahead 
into the next iteration of c->get_i()

So if c->get_i() is super fast (inlined) it may finish and then the CPU 
must wait for the divide before going further.  If c->get_i() is much 
slower it still may be only a trivial amount slower than the divide, so 
the overlapped time is about the same.

If there were no such overlap, I'd expect a bigger difference between 
inline and virtual.  If a divide were overlapped, I'd expect a virtual 
call with no branch mis prediction to be entirely covered by the 
overlap, so no difference in total execution time.  So your result seems 
to fit a multiply overlapped with the virtual call.  But I'm far from 
sure.  I'd need to see the generated asm code to have even a better guess.

As for the main question:  Like most performance questions, simple tests 
lead to consistently distorted answers.  Performance is a complex question.