On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > >> >> >> > >> >> >> > >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >>> > RAW TSC NTP corrected TSC > >> >> >>> > t0 10 10 > >> >> >>> > t1 20 19.99 > >> >> >>> > t2 30 29.98 > >> >> >>> > t3 40 39.97 > >> >> >>> > t4 50 49.96 > > > > (1) > > > >> >> >>> > > >> >> >>> > ... > >> >> >>> > > >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >>> > you can see what will happen. > >> >> >>> > >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> > >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> >> resume, the TSC certainly increases at the same rate as before, but the > >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> >> than the guest kvmclock. > >> >> > > >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> > > >> >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> > backwards. > >> >> > > >> >> > If it's the guest's, then the guest's NTP correction is applied on top > >> >> > of kvmclock, and this shouldn't matter. > >> >> > > >> >> > I still feel like I'm missing something very basic here. > >> >> > > >> >> > >> >> OK, I think I get it. > >> >> > >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> then, after resume, the host's new system_time would match the guest's > >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> think there would be a problem. > >> >> That being said, I can't find the code in the masterclock stuff that > >> >> would actually do this. > >> > > >> > Guest clock is maintained by guest timekeeping code, which does: > >> > > >> > timer_interrupt() > >> > offset = read clocksource since last timer interrupt > >> > accumulate_to_systemclock(offset) > >> > > >> > The frequency correction of NTP in the host can be applied to > >> > kvmclock, which will be visible to the guest > >> > at "read clocksource since last timer interrupt" > >> > (kvmclock_clocksource_read function). > >> > >> pvclock_clocksource_read? That seems to do the same thing as all the > >> other clocksource access functions. > >> > >> > > >> > This does not mean that the NTP correction in the host is propagated > >> > to the guests system clock directly. > >> > > >> > (For example, the guest can run NTP which is free to do further > >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> > >> Of course. But I expected that, in the absence of NTP on the guest, > >> that the guest would track the host's *corrected* time. > >> > >> > > >> >> If, on the other hand, the host's NTP correction is not supposed to > >> >> propagate to the guest, > >> > > >> > This is optional. There is a module option to control this, in fact. > >> > > >> > Its nice to have, because then you can execute a guest without NTP > >> > (say without network connection), and have a kvmclock (kvmclock is a > >> > clocksource, not a guest system clock) which is NTP corrected. > >> > >> Can you point to how this works? I found kvm_guest_time_update, whch > >> is called under circumstances that I haven't untangled. I can't > >> really tell what it's trying to do. > > > > Documentation/virtual/kvm/timekeeping.txt. > > > > That document is really long. I skimmed it and found nothing. kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set. This happens when: - kvmclock is enabled or disabled by the guest. - periodically to propagate NTP correction to kvmclock clock. - guest vcpu switching between host pcpus when TSCs are out of sync. - after migration. - after savevm/loadvm. > >> In any case, this still seems much more convoluted than it has to be. > >> In the case in which the host has a stable TSC (tsc is selected in the > >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all > >> the time on the last few generations of CPUs, then the core > >> timekeeping code is already exposing a linear function that's supposed > >> to be used for monotonic, cpu-local access to a corrected nanosecond > >> counter. It's even in pretty much exactly the right form to pass > >> through to the guest via pvclock in the gtod data. Why doesn't KVM > >> pass it through verbatim, updated in real time? Is there some legacy > >> reason that KVM must apply its own corrections and has to jump through > >> hoops to pause vcpus when updating those vcpu's copies of the pvclock > >> data? > > > > Read the comment on x86.c which starts with > > " * > > * Assuming a stable TSC across physical CPUS, and a stable TSC > > * across virtual CPUs, the following condition is possible. > > * Each numbered line represents an event visible to both > > * CPUs at the next numbered event. > > " > > A couple things: > > 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M)) > > but that's wrong, I think. rdtsc is a function, not a number. View it as a number, then its correct. > Shouldn't it be: > > timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M)) Think "rdtsc" is one number (rdtsc0 = rdtsc1). > which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M < > N + (rdtsc1 - rdtsc0)? > > That doesn't change the conclusion. > > In any case, I'm not arguing that the concept of a master copy is > unnecessary; I'm arguing that the implementation, the calculations, > and the machinations in the code are all very, very complicated. All > that should be needed is to keep all of the vcpu pvti copies the same > and to make sure that you can't ever have one vcpu see a new copy and > then another vcpu see an old copy. Yes, you can't allow two vcpus to see a different copy of the pvti structure. > You can do that by brute-force > freezing all vcpus on an update (what happens now), or you could do it > by just writing all of the copies at the same time from the same host > cpu *while other vcpus are still running*. Ok, can you do that and guarantee the first copy won't be seen by other vcpus? I don't know how. > For the best outcome, you could offer a pvclock protocol v3 in which > there is literally just one pvti copy shared by all vcpus. Sure, lets write a more formal proposal? > >> >> then shouldn't KVM just update system_time on > >> >> resume to whatever the guest would think it had (which I think would > >> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly > >> >> shifted by some per-guest constant offset). > >> >> > >> >> --Andy > >> > > >> > Sure, you could add a correction to compensate and make sure > >> > the guest clock does not see time backwards. > >> > > >> > >> Could you help do that? You understand the code far better than I do. > > > > Sure, you have to save the guests view of time (system_time + scaled tsc > > read) when suspending, and add an offset to get_kernel_ns() to > > compensate the effect of (1) when resuming. > > > > Does that make sense? > > I think so. > > --Andy > > > > >> As it stands, it simply doesn't work on any system that suspends and > >> resumes (unless maybe the system has the upcoming Intel ART feature, > >> and I have no clue when that'll show up). > >> > >> --Andy > > > > -- > Andy Lutomirski > AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html