Re: kvmclock doesn't work, help?

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Thu, 17 Dec 2015 08:33:17 -0800

On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>> >>
>> >>
>> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >>> >         RAW TSC                 NTP corrected TSC
>> >>> > t0      10                      10
>> >>> > t1      20                      19.99
>> >>> > t2      30                      29.98
>> >>> > t3      40                      39.97
>> >>> > t4      50                      49.96
>> >>> >
>> >>> > ...
>> >>> >
>> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >>> > you can see what will happen.
>> >>>
>> >>> Sure, but why would you ever switch from one to the other?
>> >>
>> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> than the guest kvmclock.
>> >
>> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >
>> > If it's the host's, then wouldn't systemtime be reset after resume to
>> > the NTP corrected value?  If so, the guest wouldn't see time go
>> > backwards.
>> >
>> > If it's the guest's, then the guest's NTP correction is applied on top
>> > of kvmclock, and this shouldn't matter.
>> >
>> > I still feel like I'm missing something very basic here.
>> >
>>
>> OK, I think I get it.
>>
>> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> correction to the guest.  If it did, indeed, propagate the correction
>> then, after resume, the host's new system_time would match the guest's
>> idea of it (after accounting for the guest's long nap), and I don't
>> think there would be a problem.
>> That being said, I can't find the code in the masterclock stuff that
>> would actually do this.
>
> Guest clock is maintained by guest timekeeping code, which does:
>
> timer_interrupt()
>         offset = read clocksource since last timer interrupt
>         accumulate_to_systemclock(offset)
>
> The frequency correction of NTP in the host can be applied to
> kvmclock, which will be visible to the guest
> at "read clocksource since last timer interrupt"
> (kvmclock_clocksource_read function).

pvclock_clocksource_read?  That seems to do the same thing as all the
other clocksource access functions.

>
> This does not mean that the NTP correction in the host is propagated
> to the guests system clock directly.
>
> (For example, the guest can run NTP which is free to do further
> adjustments at "accumulate_to_systemclock(offset)" time).

Of course.  But I expected that, in the absence of NTP on the guest,
that the guest would track the host's *corrected* time.

>
>> If, on the other hand, the host's NTP correction is not supposed to
>> propagate to the guest,
>
> This is optional. There is a module option to control this, in fact.
>
> Its nice to have, because then you can execute a guest without NTP
> (say without network connection), and have a kvmclock (kvmclock is a
> clocksource, not a guest system clock) which is NTP corrected.

Can you point to how this works?  I found kvm_guest_time_update, whch
is called under circumstances that I haven't untangled.  I can't
really tell what it's trying to do.

In any case, this still seems much more convoluted than it has to be.
In the case in which the host has a stable TSC (tsc is selected in the
core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
the time on the last few generations of CPUs, then the core
timekeeping code is already exposing a linear function that's supposed
to be used for monotonic, cpu-local access to a corrected nanosecond
counter.  It's even in pretty much exactly the right form to pass
through to the guest via pvclock in the gtod data.  Why doesn't KVM
pass it through verbatim, updated in real time?  Is there some legacy
reason that KVM must apply its own corrections and has to jump through
hoops to pause vcpus when updating those vcpu's copies of the pvclock
data?

>
>> then shouldn't KVM just update system_time on
>> resume to whatever the guest would think it had (which I think would
>> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
>> shifted by some per-guest constant offset).
>>
>> --Andy
>
> Sure, you could add a correction to compensate and make sure
> the guest clock does not see time backwards.
>

Could you help do that?  You understand the code far better than I do.

As it stands, it simply doesn't work on any system that suspends and
resumes (unless maybe the system has the upcoming Intel ART feature,
and I have no clue when that'll show up).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html