Hi Radim, [Stephen, I am cc'ing you because we are managing to confuse ourselves here and some of my ramblings are basically misrepresented bits of information I got from you]. On Wed, Mar 15, 2017 at 04:57:24PM +0100, Radim Krčmář wrote: > 2017-03-15 09:43+0100, Christoffer Dall: > > On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote: > >> 2017-03-14 19:39+0100, Christoffer Dall: > >> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: > >> >> 2017-03-14 09:26+0100, Christoffer Dall: > >> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > >> >> >> 2017-03-08 02:57-0800, Christoffer Dall: > >> >> >> > Hi Paolo, > >> >> >> > > >> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > >> >> >> > before enabling interrupts when coming back from the guest. > >> >> >> > > >> >> >> > Unfortunately, this appears to mess up my view of CPU usage using > >> >> >> > something like htop on the host, because it appears all time is spent > >> >> >> > inside the kernel. > >> >> >> > > >> >> >> > From my analysis, I think this is because we never handle any interrupts > >> >> >> > before enabling interrupts, where the x86 code does its > >> >> >> > handle_external_intr, and the result on ARM is that we never increment > >> >> >> > jiffies before doing the vtime accounting. > >> >> >> > >> >> >> (Hm, the counting might be broken on nohz_full then.) > >> >> >> > >> >> > > >> >> > Don't you still have a scheduler tick even with nohz_full and something > >> >> > that will eventually update jiffies then? > >> >> > >> >> Probably, I don't understand jiffies accounting too well and didn't see > >> >> anything that would bump the jiffies in or before guest_exit_irqoff(). > >> > > >> > As far as I understand, from my very very short time of looking at the > >> > timer code, jiffies are updated on every tick, which can be cause by a > >> > number of events, including *any* interrupt handler (coming from idle > >> > state), soft timers, timer interrupts, and possibly other things. > >> > >> Yes, I was thinking that entering/exiting user mode should trigger it as > >> well, in order to correctly account for time spent there, but couldn't > >> find it ... > > > > Isn't it based on taking the timer interrupt when running in userspace > > as opposed to when running in the kernel? > > The timer interrupt usually accounts it, but there is no timer interrupt > when tickless, so there has to be accounting somewhere on the border ... I think there are; only it's not a regular interval but only when a timing is needed, such as the time slice expiring as decided by the scheduler. > > __context_tracking_enter/exit(CONTEXT_USER) calls > vtime_user_enter/exit() to do accounting in that case, but expects that > jiffies were updated by someone else. > I'm assuming user/kernel accounting works today with tickless, so surely this is handled somewhere. There are a bunch of calling paths to tick_do_update_jiffies64() in kernel/time/tick-sched.c related to nohz, so my best guess is that it's updated at the sched_clock granularity and that is sufficient for profiling, because you basically care about where a particualr time slice was spent, I suppose. > >> The case I was wondering about is if the kernel spent e.g. 10 jiffies in > >> guest mode and then exited on mmio -- no interrupt in the host, and > >> guest_exit_irqoff() would flip accouting would over system time. > >> Can those 10 jiffies get accounted to system (not guest) time? > > > > Yes, I think the key is whether you end up taking a timer interrupt > > before or after switchign PF_VCPU. So you can spend X jiffies in the > > guest, come back, change PF_VCPU (timer still hasen't expired), and then > > the timer expires immediately afterwards, and the whole block of jiffies > > that are incremented as a result of the timer gets accounted as kernel > > time. > > Wouldn't that be result of a bug? > I don't think anyone cares about this level of granularity really, the case I mentioned above is possible, albeit unlikely. > (The timer period is 1 jiffy, so the VM would get kicked before getting > to 2 jiffies in guest mode and spending so much time in bettwen seems > like a bug. > 10 jiffies could be spend in a guest while under nohz_full, but there is > still at least one CPU with HZ timer that updates the global jiffies, so > I assume that guest_exit() should see at least 9 jiffies.) > > > (Note that jiffies on clocksourced architectures is an arbitrary number, > > which somehow scales according to the clocksource frequency, depending > > on the CONFIG_HZ, so a single timer interrupt can increment jiffies by > > more than 1. > > > >> Accounting 1 jiffy wrong is normal as we expect that the distribution of > >> guest/kernel time in the jiffy is going to be approximated over a longer > >> sampling period, but if we could account multiple jiffies wrong, this > >> expectation is very hard to defend. > >> > > > > I think it's a best effort, and we expect inaccuracies over much more > > than a single jiffy, see above. > > > >> >> >> > So my current idea is to increment jiffies according to the clocksource > >> >> >> > before calling guest_exit_irqoff, but this would require some main > >> >> >> > clocksource infrastructure changes. > >> >> >> > >> >> >> This seems similar to calling the function from the timer interrupt. > >> >> >> The timer interrupt would be delivered after that and only wasted time, > >> >> >> so it might actually be slower than just delivering it before ... > >> >> > > >> >> > That's assuming that the timer interrupt hits at every exit. I don't > >> >> > think that's the case, but I should measure it. > >> >> > >> >> There cannot be less vm exits and I think there are far more vm exits, > >> >> but if there was no interrupt, then jiffies shouldn't raise and we would > >> >> get the same result as with plain guest_exit_irqoff(). > >> >> > >> > > >> > That's true if you're guaranteed to take the timer interrupts that > >> > happen while running the guest before hitting guest_exit_irqoff(), so > >> > that you eventually count *some* time for the guest. In the arm64 case, > >> > if we just do guest_exit_irqoff(), we *never* account any time to the > >> > guest. > >> > >> Yes. I assumed that if jiffies should be incremented, then you also get > >> a timer tick, > > > > I think jiffies can be incremented in a number of cases, and as I > > understand it, when the kernel folks talk about "the tick" that doesn't > > necessarily mean that your scheduler timer fires an interrupt, but I > > could be wrong. > > Good point. do_timer() (the only place to update jiffies, I hope) is as I understand it, the whole jiffies thing is circular, so the update can happen in other paths than do_timer. > called from: > > tick_periodic() > xtime_update() > tick_do_update_jiffies64() > > The first two seem to coincide with a specific timer interrupt. The > last one is called from: > > tick_sched_do_timer() [timer interrupt] > tick_nohz_update_jiffies() [any interrupt] > do_idle() [not interrupt, but also never hit after VM exit] > > Looks like jiffies are mostly timer interrupt based, but if you have > nohz, then there are also other interrupts and switches from idle. > yes exactly, clearly, non of us are timekeeping experts... > >> so checking whether jiffies should be incremented inside > >> guest_exit_irqoff() was only slowing down the common case, where jiffies > >> remained the same. > > > > My original suggestion was to not use jiffies at all, but use ktime, on > > architectures that don't do handle_external_intr() type things, which > > would result in a clocksource read on every exit, which on ARM I think > > is faster than our enable/disable interrupt dance we do at the moment. > > I didn't catch that, sorry. > > > But of course, I'd have to measure that. > > The accounting would need to change to ktime even for userspace switches good point. > to avoid double-accounting time and I think that ktime was abolished > because it was too slow ... definitely a lot of benchmarking. > > >> (Checking if jiffies should be incremented is quite slow by itself.) > >> > > > > Really? I think that depends on the architecture. Reading the counter > > on ARM is supposed to be cheap, I think. > > True, and also depends on what slow means. :) > > IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which > would be rougly 2% of a current VM exit/entry loop. > > >> >> > I assume there's a good reason why we call guest_enter() and > >> >> > guest_exit() in the hot path on every KVM architecture? > >> >> > >> >> I consider myself biased when it comes to jiffies, so no judgement. :) > >> >> > >> >> From what I see, the mode switch is used only for statistics. > >> >> The original series is > >> >> > >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 > >> >> > >> >> It didn't introduce the overhead with interrupt window and it didn't > >> >> count host kernel irq time as user time, so it was better at that time. > >> > > >> > Yes, but it was based on cputime_to... functions, which I understand use > >> > ktime, which on systems running KVM will most often read the clocksource > >> > directly from the hardware, and that was then optimized later to just > >> > use jiffies to avoid the clocksource read because jiffies is already in > >> > memory and adjusted to the granularity we need, so in some sense an > >> > improvement, only it doesn't work if you don't update jiffies when > >> > needed. > >> > >> True. The kvm_guest_enter/exit still didn't trigger accounting, but the > >> granularity was better if there were other sources of accounting than > >> just timer ticks. > > > > I think the granularity was better, yes, but I wonder if that improved > > accuracy was ever visible to the user, given that the CPU time may be > > counted in jiffies? > > I think the time for statistics used to be in cycles before it switched > to nanoseconds, so user could have seen it some difference. > > >> And I noticed another funny feature: the original intention was that if > >> the guest uses hardware acceleration, then the whole timeslice gets > >> accounted to guest/user time, because kvm_guest_exit() was not supposed > >> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 > > > > I don't see how this works on a current kernel though, because I can't > > find any path beyond the kvm_guest_exit() which clears the flag at the > > moment. > > Right, the flag used to be cleared in account_system_time(). > > >> This somehow got mangled when merging and later there was a fix that > >> introduced the current behavior, which might be slightly more accurate: > >> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()") > > > > So Laurent was confused as well? > > It looks that Laurent understood that you need the timer tick before > clearing the flag. No idea how it really happened. > > > It seems he needed to add that before you guys added > > handle_external_intr()? > > Yes. In any case, it would be good for us to get some answers to the stuff above, but I'll also try to prototype the load/put accounting of time. Thanks, -Christoffer