Re: kvmclock doesn't work, help?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>> >> >>
>> >> >>
>> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
>> >> >>> >         RAW TSC                 NTP corrected TSC
>> >> >>> > t0      10                      10
>> >> >>> > t1      20                      19.99
>> >> >>> > t2      30                      29.98
>> >> >>> > t3      40                      39.97
>> >> >>> > t4      50                      49.96
>
> (1)
>
>> >> >>> >
>> >> >>> > ...
>> >> >>> >
>> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
>> >> >>> > you can see what will happen.
>> >> >>>
>> >> >>> Sure, but why would you ever switch from one to the other?
>> >> >>
>> >> >> The guest uses the raw TSC and systemtime = 0 until suspend.  After
>> >> >> resume, the TSC certainly increases at the same rate as before, but the
>> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
>> >> >> than the guest kvmclock.
>> >> >
>> >> > Wait, are we talking about the host's NTP or the guest's NTP?
>> >> >
>> >> > If it's the host's, then wouldn't systemtime be reset after resume to
>> >> > the NTP corrected value?  If so, the guest wouldn't see time go
>> >> > backwards.
>> >> >
>> >> > If it's the guest's, then the guest's NTP correction is applied on top
>> >> > of kvmclock, and this shouldn't matter.
>> >> >
>> >> > I still feel like I'm missing something very basic here.
>> >> >
>> >>
>> >> OK, I think I get it.
>> >>
>> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
>> >> correction to the guest.  If it did, indeed, propagate the correction
>> >> then, after resume, the host's new system_time would match the guest's
>> >> idea of it (after accounting for the guest's long nap), and I don't
>> >> think there would be a problem.
>> >> That being said, I can't find the code in the masterclock stuff that
>> >> would actually do this.
>> >
>> > Guest clock is maintained by guest timekeeping code, which does:
>> >
>> > timer_interrupt()
>> >         offset = read clocksource since last timer interrupt
>> >         accumulate_to_systemclock(offset)
>> >
>> > The frequency correction of NTP in the host can be applied to
>> > kvmclock, which will be visible to the guest
>> > at "read clocksource since last timer interrupt"
>> > (kvmclock_clocksource_read function).
>>
>> pvclock_clocksource_read?  That seems to do the same thing as all the
>> other clocksource access functions.
>>
>> >
>> > This does not mean that the NTP correction in the host is propagated
>> > to the guests system clock directly.
>> >
>> > (For example, the guest can run NTP which is free to do further
>> > adjustments at "accumulate_to_systemclock(offset)" time).
>>
>> Of course.  But I expected that, in the absence of NTP on the guest,
>> that the guest would track the host's *corrected* time.
>>
>> >
>> >> If, on the other hand, the host's NTP correction is not supposed to
>> >> propagate to the guest,
>> >
>> > This is optional. There is a module option to control this, in fact.
>> >
>> > Its nice to have, because then you can execute a guest without NTP
>> > (say without network connection), and have a kvmclock (kvmclock is a
>> > clocksource, not a guest system clock) which is NTP corrected.
>>
>> Can you point to how this works?  I found kvm_guest_time_update, whch
>> is called under circumstances that I haven't untangled.  I can't
>> really tell what it's trying to do.
>
> Documentation/virtual/kvm/timekeeping.txt.
>

That document is really long.  I skimmed it and found nothing.

>> In any case, this still seems much more convoluted than it has to be.
>> In the case in which the host has a stable TSC (tsc is selected in the
>> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
>> the time on the last few generations of CPUs, then the core
>> timekeeping code is already exposing a linear function that's supposed
>> to be used for monotonic, cpu-local access to a corrected nanosecond
>> counter.  It's even in pretty much exactly the right form to pass
>> through to the guest via pvclock in the gtod data.  Why doesn't KVM
>> pass it through verbatim, updated in real time?  Is there some legacy
>> reason that KVM must apply its own corrections and has to jump through
>> hoops to pause vcpus when updating those vcpu's copies of the pvclock
>> data?
>
> Read the comment on x86.c which starts with
> " *
>  * Assuming a stable TSC across physical CPUS, and a stable TSC
>  * across virtual CPUs, the following condition is possible.
>  * Each numbered line represents an event visible to both
>  * CPUs at the next numbered event.
> "

A couple things:

1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))

but that's wrong, I think.  rdtsc is a function, not a number.  Shouldn't it be:

timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))

which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
N + (rdtsc1 - rdtsc0)?

That doesn't change the conclusion.

In any case, I'm not arguing that the concept of a master copy is
unnecessary; I'm arguing that the implementation, the calculations,
and the machinations in the code are all very, very complicated.  All
that should be needed is to keep all of the vcpu pvti copies the same
and to make sure that you can't ever have one vcpu see a new copy and
then another vcpu see an old copy.  You can do that by brute-force
freezing all vcpus on an update (what happens now), or you could do it
by just writing all of the copies at the same time from the same host
cpu *while other vcpus are still running*.

For the best outcome, you could offer a pvclock protocol v3 in which
there is literally just one pvti copy shared by all vcpus.

>
>> >> then shouldn't KVM just update system_time on
>> >> resume to whatever the guest would think it had (which I think would
>> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
>> >> shifted by some per-guest constant offset).
>> >>
>> >> --Andy
>> >
>> > Sure, you could add a correction to compensate and make sure
>> > the guest clock does not see time backwards.
>> >
>>
>> Could you help do that?  You understand the code far better than I do.
>
> Sure, you have to save the guests view of time (system_time + scaled tsc
> read) when suspending, and add an offset to get_kernel_ns() to
> compensate the effect of (1) when resuming.
>
> Does that make sense?

I think so.

--Andy

>
>> As it stands, it simply doesn't work on any system that suspends and
>> resumes (unless maybe the system has the upcoming Intel ART feature,
>> and I have no clue when that'll show up).
>>
>> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux