Re: exit timing analysis v1 - comments&discussions welcome

Hollis Blanchard <hollisb@xxxxxxxxxx> · Wed, 08 Oct 2008 10:41:14 -0500

On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
> Wondering about that 30.5% for postprocessing and 
> kvmppc_check_and_deliver_interrupts I quickly checked that in detail - 
> part d is now divided in 4 subparts.
> I also looked at the return to guest path if the expected part 
> (restoring tlb) is really the main time eater there. The result shows 
> clearly that it is.
> 
> more detailed breakdown:
> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
> b)   8.12% - reaching kvmppc_handle_exit
> c)   7.59%  - syscall exit is checked and a interrupt is queued using 
> kvmppc_queue_exception
> d1)  3.33%  - some checks for all exits
> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
> d3) 17.20% - can_deliver/clear&deliver exception in 
> kvmppc_check_and_deliver_interrupts
> d4)  4.47% - updating kvm_stat statistics
> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
> f1) 29.18% - restoring guest tlb
> f2)  4.69% - restoring guest state ([s]regs)
> 
> These fractions are % of our ~12µs syscall exit.
> => restoring tlb on each reenter = 4µs constant overhead
> => looking a bit into irq delivery and other constant things like 
> kvm_stat updating
> 
...
> 
> Now I go for the TLB replacement in f1.

Hang on... does d3 make sense to you? It doesn't to me, and if there's a
bug there it will be easier to fix than rewriting the TLB code. :)

I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
of 12µs is 2064ns, or about 1300 cycles. (Check my math.)

Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
that time is going. We're assuming the first_first_bit() loop usually
executes once, for syscall. Does it actually execute more than that? I
don't expect any of kvmppc_can_deliver_interrupt(),
kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
take lots of time.

Could it be cache effects? exception_priority[] and priority_exception[]
are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
fit into one... except they're not aligned.

Also, it looks like we use the generic find_first_bit(). That may be
more expensive than we'd like. However, since
vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
bitfield), we should be able to use ffs() instead, which has an
optimized PowerPC implementation. That might help a lot.

We might even be able to replace find_next_bit() too, by shifting a mask
over each loop, but I don't think we'll have to, since I expect the
common case to be we can deliver the first pending exception. (Worth
checking? :)

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html