Hi David, On 10/1/23 10:54, David Woodhouse wrote: > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > When populating the guest's PV wall clock information, KVM currently does > a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern > which should be avoided; when working with the relationship between two > clocks, it's never correct to obtain one of them "now" and then the other > at a slightly different "now" after an unspecified period of preemption > (which might not even be under the control of the kernel, if this is an > L1 hosting an L2 guest under nested virtualization). > > Add a kvm_get_wall_clock_epoch() function to return the guest wall clock > epoch in nanoseconds using the same method as __get_kvmclock() — by using > kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM > clock time from a *single* TSC reading. > > The condition using get_cpu_tsc_khz() is equivalent to the version in > __get_kvmclock() which separately checks for the CONSTANT_TSC feature or > the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway. > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > --- > Tested by sticking a printk into it and comparing the values calculated > by the old and new methods, while running the xen_shinfo_test which > keeps relocating the shared info and thus rewriting the PV wall clock > information to it. > > They look sane enough but there's still skew over time (in both of > them) as the KVM values get adjusted in presumably similarly sloppy > ways. But we'll get to this. This patch is just the first low hanging About the "skew over time", would you mean the results of kvm_get_wall_clock_epoch() keeps changing over time? Although without testing, I suspect it is because of two reasons: 1. Would you mind explaining what does "as the KVM values get adjusted" mean? The kvm_get_walltime_and_clockread() call is based host monotonic clock, which may be adjusted (unlike raw monotonic). 2. The host monotonic clock and kvmclock may use different mult/shift. The equation is A-B. A is the current host wall clock, while B is for how long the VM has boot. A-B will be the wallclock when VM is boot. A: ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec --> monotonic clock B: __pvclock_read_cycles(&hv_clock, host_tsc); --> raw monotonic and kvmclock The A is from kvm_get_walltime_and_clockread() to get a pair of ns and tsc. It is based on monotonic clock, e.g., gtod->clock.shift and gtod->clock.mult. BTW, the master clock is derived from raw monotonic, which uses gtod->raw_clock.shift and gtod->raw_clock.mult. However, the incremental between host_tsc and master clock will be based on the mult/shift from kvmclock (indeed kvm_get_time_scale()). Ideally, we may expect A and B increase in the same speed. Due to that they may use different mult/shift/equation, A and B may increase in the different speed. About the 2nd reason, I have a patch in progress to refresh the master clock periodically, for the clock drift during CPU hotplug. https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/ Please correct me if the above understanding is wrong. Thank you very much! Dongli Zhang > fruit in a quest to eliminate such sloppiness and get to the point > where we can do live update (pause guests, kexec and resume them again) > with a *single* cycle of skew. After all, the host TSC is still just as > trustworthy so there's no excuse for *anything* changing over the > kexec. Every other clock in the guest should have precisely the *same* > relationship to the host TSC as it did before the kexec. > > arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++++++------ > arch/x86/kvm/x86.h | 2 ++ > arch/x86/kvm/xen.c | 4 ++-- > 3 files changed, 57 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 04b57a336b34..625ec4d9281b 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2317,14 +2317,9 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o > if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version))) > return; > > - /* > - * The guest calculates current wall clock time by adding > - * system time (updated by kvm_guest_time_update below) to the > - * wall clock specified here. We do the reverse here. > - */ > - wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm); > + wall_nsec = kvm_get_wall_clock_epoch(kvm); > > - wc.nsec = do_div(wall_nsec, 1000000000); > + wc.nsec = do_div(wall_nsec, NSEC_PER_SEC); > wc.sec = (u32)wall_nsec; /* overflow in 2106 guest time */ > wc.version = version; > > @@ -3229,6 +3224,57 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > return 0; > } > > +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm) > +{ > + /* > + * The guest calculates current wall clock time by adding > + * system time (updated by kvm_guest_time_update below) to the > + * wall clock specified here. We do the reverse here. > + */ > +#ifdef CONFIG_X86_64 > + struct pvclock_vcpu_time_info hv_clock; > + struct kvm_arch *ka = &kvm->arch; > + unsigned long seq, local_tsc_khz = 0; > + struct timespec64 ts; > + uint64_t host_tsc; > + > + do { > + seq = read_seqcount_begin(&ka->pvclock_sc); > + > + if (!ka->use_master_clock) > + break; > + > + /* It all has to happen on the same CPU */ > + get_cpu(); > + > + local_tsc_khz = get_cpu_tsc_khz(); > + > + if (local_tsc_khz && > + !kvm_get_walltime_and_clockread(&ts, &host_tsc)) > + local_tsc_khz = 0; /* Fall back to old method */ > + > + hv_clock.tsc_timestamp = ka->master_cycle_now; > + hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > + > + put_cpu(); > + } while (read_seqcount_retry(&ka->pvclock_sc, seq)); > + > + /* > + * If the conditions were right, and obtaining the wallclock+TSC was > + * successful, calculate the KVM clock at the corresponding time and > + * subtract one from the other to get the epoch in nanoseconds. > + */ > + if (local_tsc_khz) { > + kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL, > + &hv_clock.tsc_shift, > + &hv_clock.tsc_to_system_mul); > + return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec - > + __pvclock_read_cycles(&hv_clock, host_tsc); > + } > +#endif > + return ktime_get_real_ns() - get_kvmclock_ns(kvm); > +} > + > /* > * kvmclock updates which are isolated to a given vcpu, such as > * vcpu->cpu migration, should not allow system_timestamp from > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > index c544602d07a3..b21743526011 100644 > --- a/arch/x86/kvm/x86.h > +++ b/arch/x86/kvm/x86.h > @@ -290,6 +290,8 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk) > return !(kvm->arch.disabled_quirks & quirk); > } > > +uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); > + > void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); > > u64 get_kvmclock_ns(struct kvm *kvm); > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > index 75586da134b3..6bab715be428 100644 > --- a/arch/x86/kvm/xen.c > +++ b/arch/x86/kvm/xen.c > @@ -59,7 +59,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) > * This code mirrors kvm_write_wall_clock() except that it writes > * directly through the pfn cache and doesn't mark the page dirty. > */ > - wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm); > + wall_nsec = kvm_get_wall_clock_epoch(kvm); > > /* It could be invalid again already, so we need to check */ > read_lock_irq(&gpc->lock); > @@ -98,7 +98,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) > wc_version = wc->version = (wc->version + 1) | 1; > smp_wmb(); > > - wc->nsec = do_div(wall_nsec, 1000000000); > + wc->nsec = do_div(wall_nsec, NSEC_PER_SEC); > wc->sec = (u32)wall_nsec; > *wc_sec_hi = wall_nsec >> 32; > smp_wmb();