Re: vtime accounting

Christoffer Dall <cdall@xxxxxxxxxx> · Wed, 15 Mar 2017 17:48:57 +0100

Hi Radim,

[Stephen, I am cc'ing you because we are managing to confuse ourselves
here and some of my ramblings are basically misrepresented bits of
information I got from you].

On Wed, Mar 15, 2017 at 04:57:24PM +0100, Radim Krčmář wrote:
> 2017-03-15 09:43+0100, Christoffer Dall:
> > On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote:
> >> 2017-03-14 19:39+0100, Christoffer Dall:
> >> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
> >> >> 2017-03-14 09:26+0100, Christoffer Dall:
> >> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> >> >> >> 2017-03-08 02:57-0800, Christoffer Dall:
> >> >> >> > Hi Paolo,
> >> >> >> > 
> >> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> >> >> >> > before enabling interrupts when coming back from the guest.
> >> >> >> > 
> >> >> >> > Unfortunately, this appears to mess up my view of CPU usage using
> >> >> >> > something like htop on the host, because it appears all time is spent
> >> >> >> > inside the kernel.
> >> >> >> > 
> >> >> >> > From my analysis, I think this is because we never handle any interrupts
> >> >> >> > before enabling interrupts, where the x86 code does its
> >> >> >> > handle_external_intr, and the result on ARM is that we never increment
> >> >> >> > jiffies before doing the vtime accounting.
> >> >> >> 
> >> >> >> (Hm, the counting might be broken on nohz_full then.)
> >> >> >> 
> >> >> > 
> >> >> > Don't you still have a scheduler tick even with nohz_full and something
> >> >> > that will eventually update jiffies then?
> >> >> 
> >> >> Probably, I don't understand jiffies accounting too well and didn't see
> >> >> anything that would bump the jiffies in or before guest_exit_irqoff().
> >> > 
> >> > As far as I understand, from my very very short time of looking at the
> >> > timer code, jiffies are updated on every tick, which can be cause by a
> >> > number of events, including *any* interrupt handler (coming from idle
> >> > state), soft timers, timer interrupts, and possibly other things.
> >> 
> >> Yes, I was thinking that entering/exiting user mode should trigger it as
> >> well, in order to correctly account for time spent there, but couldn't
> >> find it ...
> > 
> > Isn't it based on taking the timer interrupt when running in userspace
> > as opposed to when running in the kernel?
> 
> The timer interrupt usually accounts it, but there is no timer interrupt
> when tickless, so there has to be accounting somewhere on the border ...

I think there are; only it's not a regular interval but only when a
timing is needed, such as the time slice expiring as decided by the
scheduler.

> 
> __context_tracking_enter/exit(CONTEXT_USER) calls
> vtime_user_enter/exit() to do accounting in that case, but expects that
> jiffies were updated by someone else.
> 

I'm assuming user/kernel accounting works today with tickless, so surely
this is handled somewhere.  There are a bunch of calling paths to
tick_do_update_jiffies64() in kernel/time/tick-sched.c related to nohz,
so my best guess is that it's updated at the sched_clock granularity and
that is sufficient for profiling, because you basically care about where
a particualr time slice was spent, I suppose.

> >> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
> >> guest mode and then exited on mmio -- no interrupt in the host, and
> >> guest_exit_irqoff() would flip accouting would over system time.
> >> Can those 10 jiffies get accounted to system (not guest) time?
> > 
> > Yes, I think the key is whether you end up taking a timer interrupt
> > before or after switchign PF_VCPU.  So you can spend X jiffies in the
> > guest, come back, change PF_VCPU (timer still hasen't expired), and then
> > the timer expires immediately afterwards, and the whole block of jiffies
> > that are incremented as a result of the timer gets accounted as kernel
> > time.
> 
> Wouldn't that be result of a bug?
> 

I don't think anyone cares about this level of granularity really, the
case I mentioned above is possible, albeit unlikely.

> (The timer period is 1 jiffy, so the VM would get kicked before getting
>  to 2 jiffies in guest mode and spending so much time in bettwen seems
>  like a bug.
>  10 jiffies could be spend in a guest while under nohz_full, but there is
>  still at least one CPU with HZ timer that updates the global jiffies, so
>  I assume that guest_exit() should see at least 9 jiffies.)
> 
> > (Note that jiffies on clocksourced architectures is an arbitrary number,
> > which somehow scales according to the clocksource frequency, depending
> > on the CONFIG_HZ, so a single timer interrupt can increment jiffies by
> > more than 1.
> > 
> >> Accounting 1 jiffy wrong is normal as we expect that the distribution of
> >> guest/kernel time in the jiffy is going to be approximated over a longer
> >> sampling period, but if we could account multiple jiffies wrong, this
> >> expectation is very hard to defend.
> >> 
> > 
> > I think it's a best effort, and we expect inaccuracies over much more
> > than a single jiffy, see above.
> > 
> >> >> >> > So my current idea is to increment jiffies according to the clocksource
> >> >> >> > before calling guest_exit_irqoff, but this would require some main
> >> >> >> > clocksource infrastructure changes.
> >> >> >> 
> >> >> >> This seems similar to calling the function from the timer interrupt.
> >> >> >> The timer interrupt would be delivered after that and only wasted time,
> >> >> >> so it might actually be slower than just delivering it before ...
> >> >> > 
> >> >> > That's assuming that the timer interrupt hits at every exit.  I don't
> >> >> > think that's the case, but I should measure it.
> >> >> 
> >> >> There cannot be less vm exits and I think there are far more vm exits,
> >> >> but if there was no interrupt, then jiffies shouldn't raise and we would
> >> >> get the same result as with plain guest_exit_irqoff().
> >> >> 
> >> > 
> >> > That's true if you're guaranteed to take the timer interrupts that
> >> > happen while running the guest before hitting guest_exit_irqoff(), so
> >> > that you eventually count *some* time for the guest.  In the arm64 case,
> >> > if we just do guest_exit_irqoff(), we *never* account any time to the
> >> > guest.
> >> 
> >> Yes.  I assumed that if jiffies should be incremented, then you also get
> >> a timer tick,
> > 
> > I think jiffies can be incremented in a number of cases, and as I
> > understand it, when the kernel folks talk about "the tick" that doesn't
> > necessarily mean that your scheduler timer fires an interrupt, but I
> > could be wrong.
> 
> Good point.  do_timer() (the only place to update jiffies, I hope) is

as I understand it, the whole jiffies thing is circular, so the update
can happen in other paths than do_timer.

> called from:
> 
>   tick_periodic()
>   xtime_update()
>   tick_do_update_jiffies64()
> 
> The first two seem to coincide with a specific timer interrupt.  The
> last one is called from:
> 
>   tick_sched_do_timer() [timer interrupt]
>   tick_nohz_update_jiffies() [any interrupt]
>   do_idle() [not interrupt, but also never hit after VM exit]
> 
> Looks like jiffies are mostly timer interrupt based, but if you have
> nohz, then there are also other interrupts and switches from idle.
> 

yes exactly, clearly, non of us are timekeeping experts...

> >> so checking whether jiffies should be incremented inside
> >> guest_exit_irqoff() was only slowing down the common case, where jiffies
> >> remained the same.
> > 
> > My original suggestion was to not use jiffies at all, but use ktime, on
> > architectures that don't do handle_external_intr() type things, which
> > would result in a clocksource read on every exit, which on ARM I think
> > is faster than our enable/disable interrupt dance we do at the moment.
> 
> I didn't catch that, sorry.
> 
> > But of course, I'd have to measure that.
> 
> The accounting would need to change to ktime even for userspace switches

good point.

> to avoid double-accounting time and I think that ktime was abolished
> because it was too slow ... definitely a lot of benchmarking.
> 
> >> (Checking if jiffies should be incremented is quite slow by itself.)
> >> 
> > 
> > Really?  I think that depends on the architecture.  Reading the counter
> > on ARM is supposed to be cheap, I think.
> 
> True, and also depends on what slow means. :)
> 
> IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which
> would be rougly 2% of a current VM exit/entry loop.
> 
> >> >> > I assume there's a good reason why we call guest_enter() and
> >> >> > guest_exit() in the hot path on every KVM architecture?
> >> >> 
> >> >> I consider myself biased when it comes to jiffies, so no judgement. :)
> >> >> 
> >> >> From what I see, the mode switch is used only for statistics.
> >> >> The original series is
> >> >> 
> >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
> >> >> 
> >> >> It didn't introduce the overhead with interrupt window and it didn't
> >> >> count host kernel irq time as user time, so it was better at that time.
> >> > 
> >> > Yes, but it was based on cputime_to... functions, which I understand use
> >> > ktime, which on systems running KVM will most often read the clocksource
> >> > directly from the hardware, and that was then optimized later to just
> >> > use jiffies to avoid the clocksource read because jiffies is already in
> >> > memory and adjusted to the granularity we need, so in some sense an
> >> > improvement, only it doesn't work if you don't update jiffies when
> >> > needed.
> >> 
> >> True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
> >> granularity was better if there were other sources of accounting than
> >> just timer ticks.
> > 
> > I think the granularity was better, yes, but I wonder if that improved
> > accuracy was ever visible to the user, given that the CPU time may be
> > counted in jiffies?
> 
> I think the time for statistics used to be in cycles before it switched
> to nanoseconds, so user could have seen it some difference.
> 
> >> And I noticed another funny feature: the original intention was that if
> >> the guest uses hardware acceleration, then the whole timeslice gets
> >> accounted to guest/user time, because kvm_guest_exit() was not supposed
> >> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105
> > 
> > I don't see how this works on a current kernel though, because I can't
> > find any path beyond the kvm_guest_exit() which clears the flag at the
> > moment.
> 
> Right, the flag used to be cleared in account_system_time().
> 
> >> This somehow got mangled when merging and later there was a fix that
> >> introduced the current behavior, which might be slightly more accurate:
> >> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")
> > 
> > So Laurent was confused as well?
> 
> It looks that Laurent understood that you need the timer tick before
> clearing the flag.  No idea how it really happened.
> 
> > It seems he needed to add that before you guys added
> > handle_external_intr()?
> 
> Yes.

In any case, it would be good for us to get some answers to the stuff
above, but I'll also try to prototype the load/put accounting of time.

Thanks,
-Christoffer