Re: kvmclock doesn't work, help?

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 17 Dec 2015 17:08:51 -0200

On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> >> >>
> >> >>
> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >>> >         RAW TSC                 NTP corrected TSC
> >> >>> > t0      10                      10
> >> >>> > t1      20                      19.99
> >> >>> > t2      30                      29.98
> >> >>> > t3      40                      39.97
> >> >>> > t4      50                      49.96

(1)

> >> >>> >
> >> >>> > ...
> >> >>> >
> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >>> > you can see what will happen.
> >> >>>
> >> >>> Sure, but why would you ever switch from one to the other?
> >> >>
> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> than the guest kvmclock.
> >> >
> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >
> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> > the NTP corrected value?  If so, the guest wouldn't see time go
> >> > backwards.
> >> >
> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> > of kvmclock, and this shouldn't matter.
> >> >
> >> > I still feel like I'm missing something very basic here.
> >> >
> >>
> >> OK, I think I get it.
> >>
> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> correction to the guest.  If it did, indeed, propagate the correction
> >> then, after resume, the host's new system_time would match the guest's
> >> idea of it (after accounting for the guest's long nap), and I don't
> >> think there would be a problem.
> >> That being said, I can't find the code in the masterclock stuff that
> >> would actually do this.
> >
> > Guest clock is maintained by guest timekeeping code, which does:
> >
> > timer_interrupt()
> >         offset = read clocksource since last timer interrupt
> >         accumulate_to_systemclock(offset)
> >
> > The frequency correction of NTP in the host can be applied to
> > kvmclock, which will be visible to the guest
> > at "read clocksource since last timer interrupt"
> > (kvmclock_clocksource_read function).
> 
> pvclock_clocksource_read?  That seems to do the same thing as all the
> other clocksource access functions.
> 
> >
> > This does not mean that the NTP correction in the host is propagated
> > to the guests system clock directly.
> >
> > (For example, the guest can run NTP which is free to do further
> > adjustments at "accumulate_to_systemclock(offset)" time).
> 
> Of course.  But I expected that, in the absence of NTP on the guest,
> that the guest would track the host's *corrected* time.
> 
> >
> >> If, on the other hand, the host's NTP correction is not supposed to
> >> propagate to the guest,
> >
> > This is optional. There is a module option to control this, in fact.
> >
> > Its nice to have, because then you can execute a guest without NTP
> > (say without network connection), and have a kvmclock (kvmclock is a
> > clocksource, not a guest system clock) which is NTP corrected.
> 
> Can you point to how this works?  I found kvm_guest_time_update, whch
> is called under circumstances that I haven't untangled.  I can't
> really tell what it's trying to do.

Documentation/virtual/kvm/timekeeping.txt.

> In any case, this still seems much more convoluted than it has to be.
> In the case in which the host has a stable TSC (tsc is selected in the
> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> the time on the last few generations of CPUs, then the core
> timekeeping code is already exposing a linear function that's supposed
> to be used for monotonic, cpu-local access to a corrected nanosecond
> counter.  It's even in pretty much exactly the right form to pass
> through to the guest via pvclock in the gtod data.  Why doesn't KVM
> pass it through verbatim, updated in real time?  Is there some legacy
> reason that KVM must apply its own corrections and has to jump through
> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> data?

Read the comment on x86.c which starts with
" *
 * Assuming a stable TSC across physical CPUS, and a stable TSC
 * across virtual CPUs, the following condition is possible.
 * Each numbered line represents an event visible to both
 * CPUs at the next numbered event.
"

> >> then shouldn't KVM just update system_time on
> >> resume to whatever the guest would think it had (which I think would
> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> >> shifted by some per-guest constant offset).
> >>
> >> --Andy
> >
> > Sure, you could add a correction to compensate and make sure
> > the guest clock does not see time backwards.
> >
> 
> Could you help do that?  You understand the code far better than I do.

Sure, you have to save the guests view of time (system_time + scaled tsc
read) when suspending, and add an offset to get_kernel_ns() to 
compensate the effect of (1) when resuming.

Does that make sense? 

> As it stands, it simply doesn't work on any system that suspends and
> resumes (unless maybe the system has the upcoming Intel ART feature,
> and I have no clue when that'll show up).
> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html