Nava Whiteford wrote:
In this case the templated version doesn't seem to have the same huge
advantage. Templated 20.73s against 21.1s for the classed version. I would guess
real, but not huge.
Do these numbers seem reasonable?
I don't know what exactly the compiler optimized out, especially whether
it changed the divide by 2.2 into a multiple by the reciprocal.
I assume it optimized away the call for the template version.
Apparently it no longer optimizes away the whole loop for the templated
version. Apparently it doesn't optimize away the vtable lookup nor the
call for the non templated version.
Because the branch is the same every time through the loop, there is no
branch misprediction on the call. Similarly no cache misses on the push
and pop of the return address etc. That makes the difference between
inlined and virtual call a lot smaller in this test than it would be in
average use. But not as small as you measured. There is a bigger factor.
CPU's overlap a lot of operations. They especially overlap things like
floating point divide with all the flow of control things involved in
that virtual call.
I'm not certain, but I think the optimized code combined with ability of
the CPU to execute ahead may mean the floating point divide (or maybe
even the reciprocal multiply) is still pending as the CPU goes ahead
into the next iteration of c->get_i()
So if c->get_i() is super fast (inlined) it may finish and then the CPU
must wait for the divide before going further. If c->get_i() is much
slower it still may be only a trivial amount slower than the divide, so
the overlapped time is about the same.
If there were no such overlap, I'd expect a bigger difference between
inline and virtual. If a divide were overlapped, I'd expect a virtual
call with no branch mis prediction to be entirely covered by the
overlap, so no difference in total execution time. So your result seems
to fit a multiply overlapped with the virtual call. But I'm far from
sure. I'd need to see the generated asm code to have even a better guess.
As for the main question: Like most performance questions, simple tests
lead to consistently distorted answers. Performance is a complex question.