2017-03-14 19:39+0100, Christoffer Dall: > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: >> 2017-03-14 09:26+0100, Christoffer Dall: >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: >> >> 2017-03-08 02:57-0800, Christoffer Dall: >> >> > Hi Paolo, >> >> > >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff >> >> > before enabling interrupts when coming back from the guest. >> >> > >> >> > Unfortunately, this appears to mess up my view of CPU usage using >> >> > something like htop on the host, because it appears all time is spent >> >> > inside the kernel. >> >> > >> >> > From my analysis, I think this is because we never handle any interrupts >> >> > before enabling interrupts, where the x86 code does its >> >> > handle_external_intr, and the result on ARM is that we never increment >> >> > jiffies before doing the vtime accounting. >> >> >> >> (Hm, the counting might be broken on nohz_full then.) >> >> >> > >> > Don't you still have a scheduler tick even with nohz_full and something >> > that will eventually update jiffies then? >> >> Probably, I don't understand jiffies accounting too well and didn't see >> anything that would bump the jiffies in or before guest_exit_irqoff(). > > As far as I understand, from my very very short time of looking at the > timer code, jiffies are updated on every tick, which can be cause by a > number of events, including *any* interrupt handler (coming from idle > state), soft timers, timer interrupts, and possibly other things. Yes, I was thinking that entering/exiting user mode should trigger it as well, in order to correctly account for time spent there, but couldn't find it ... The case I was wondering about is if the kernel spent e.g. 10 jiffies in guest mode and then exited on mmio -- no interrupt in the host, and guest_exit_irqoff() would flip accouting would over system time. Can those 10 jiffies get accounted to system (not guest) time? Accounting 1 jiffy wrong is normal as we expect that the distribution of guest/kernel time in the jiffy is going to be approximated over a longer sampling period, but if we could account multiple jiffies wrong, this expectation is very hard to defend. >> >> > So my current idea is to increment jiffies according to the clocksource >> >> > before calling guest_exit_irqoff, but this would require some main >> >> > clocksource infrastructure changes. >> >> >> >> This seems similar to calling the function from the timer interrupt. >> >> The timer interrupt would be delivered after that and only wasted time, >> >> so it might actually be slower than just delivering it before ... >> > >> > That's assuming that the timer interrupt hits at every exit. I don't >> > think that's the case, but I should measure it. >> >> There cannot be less vm exits and I think there are far more vm exits, >> but if there was no interrupt, then jiffies shouldn't raise and we would >> get the same result as with plain guest_exit_irqoff(). >> > > That's true if you're guaranteed to take the timer interrupts that > happen while running the guest before hitting guest_exit_irqoff(), so > that you eventually count *some* time for the guest. In the arm64 case, > if we just do guest_exit_irqoff(), we *never* account any time to the > guest. Yes. I assumed that if jiffies should be incremented, then you also get a timer tick, so checking whether jiffies should be incremented inside guest_exit_irqoff() was only slowing down the common case, where jiffies remained the same. (Checking if jiffies should be incremented is quite slow by itself.) >> > I assume there's a good reason why we call guest_enter() and >> > guest_exit() in the hot path on every KVM architecture? >> >> I consider myself biased when it comes to jiffies, so no judgement. :) >> >> From what I see, the mode switch is used only for statistics. >> The original series is >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 >> >> It didn't introduce the overhead with interrupt window and it didn't >> count host kernel irq time as user time, so it was better at that time. > > Yes, but it was based on cputime_to... functions, which I understand use > ktime, which on systems running KVM will most often read the clocksource > directly from the hardware, and that was then optimized later to just > use jiffies to avoid the clocksource read because jiffies is already in > memory and adjusted to the granularity we need, so in some sense an > improvement, only it doesn't work if you don't update jiffies when > needed. True. The kvm_guest_enter/exit still didn't trigger accounting, but the granularity was better if there were other sources of accounting than just timer ticks. And I noticed another funny feature: the original intention was that if the guest uses hardware acceleration, then the whole timeslice gets accounted to guest/user time, because kvm_guest_exit() was not supposed to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 This somehow got mangled when merging and later there was a fix that introduced the current behavior, which might be slightly more accurate: 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")