Re: exit timing analysis v1 - comments&discussions welcome

Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx> · Thu, 09 Oct 2008 10:02:46 +0200

Hollis Blanchard wrote:
On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:

Wondering about that 30.5% for postprocessing and 
kvmppc_check_and_deliver_interrupts I quickly checked that in detail - 
part d is now divided in 4 subparts.
I also looked at the return to guest path if the expected part 
(restoring tlb) is really the main time eater there. The result shows 
clearly that it is.

more detailed breakdown:
a)  10.94%  - exit, saving guest state (booke_interrupt.S)
b)   8.12% - reaching kvmppc_handle_exit
c)   7.59%  - syscall exit is checked and a interrupt is queued using 
kvmppc_queue_exception
d1)  3.33%  - some checks for all exits
d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
d3) 17.20% - can_deliver/clear&deliver exception in 
kvmppc_check_and_deliver_interrupts
d4)  4.47% - updating kvm_stat statistics
e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
f1) 29.18% - restoring guest tlb
f2)  4.69% - restoring guest state ([s]regs)

These fractions are % of our ~12µs syscall exit.
=> restoring tlb on each reenter = 4µs constant overhead
=> looking a bit into irq delivery and other constant things like 
kvm_stat updating

...

Now I go for the TLB replacement in f1.

Hang on... does d3 make sense to you? It doesn't to me, and if there's a
bug there it will be easier to fix than rewriting the TLB code. :)

I did not give up improving that part too :-)
I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
of 12µs is 2064ns, or about 1300 cycles. (Check my math.)

I get the same results. 1% ~ 80 cycles.
Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
that time is going. We're assuming the first_first_bit() loop usually
executes once, for syscall. Does it actually execute more than that? I
don't expect any of kvmppc_can_deliver_interrupt(),
kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
take lots of time.

You can see below that I already had a more detailed breakdown in my old 
mail:
[...]
d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in 
kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
exception in kvmppc_check_and_deliver_interrupts
[...]
Could it be cache effects? exception_priority[] and priority_exception[]
are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
fit into one... except they're not aligned.

I would be so happy if I would have hardware performance counters like 
cache misses :-)
Also, it looks like we use the generic find_first_bit(). That may be
more expensive than we'd like. However, since
vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
bitfield), we should be able to use ffs() instead, which has an
optimized PowerPC implementation. That might help a lot.

good idea.
I'll check this and some other small improvements I have in mind.

We might even be able to replace find_next_bit() too, by shifting a mask
over each loop, but I don't think we'll have to, since I expect the
common case to be we can deliver the first pending exception. (Worth
checking? :)

I'm not sure. It's surely worth checking how often that second 
find_next_bit is called.
If that number is far too small it's not worth.

--

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html