Hi Sean, A quick question ... On 10/2/23 17:53, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: >> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote: >>> On Mon, Oct 02, 2023, David Woodhouse wrote: >>>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: >>>>> >>>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. >>>>> >>>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. >>>>> >>>> >>>> That just seems wrong. I don't mean that you're incorrect; it seems >>>> *morally* wrong. >>>> >>>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use >>>> a *different* mult/shift/equation (your #1) to convert TSC ticks to >>>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). >>>> >>>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's >>>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. >>>> >>>> Fix that, and the whole problem goes away, doesn't it? >>>> >>>> What am I missing here, that means we can't do that? >>> >>> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are >>> ABI between KVM and KVM guests. >>> >>> Like many of the older bits of KVM, my guess is that KVM's behavior is the product >>> of making things kinda sorta work with old hardware, i.e. was probably the least >>> awful solution in the days before constant TSCs, but is completely nonsensical on >>> modern hardware. >> >> I still don't understand. The ABI and its math are fine. The ABI is just >> "at time X the TSC was Y, and the TSC frequency is Z" >> >> I understand why on older hardware, those values needed to *change* >> occasionally when TSC stupidity happened. >> >> But on newer hardware, surely we can set them precisely *once* when the >> VM starts, and never ever have to change them again? Theoretically not >> even when we pause the VM, kexec into a new kernel, and resume the VM! >> >> But we *are* having to change it, because apparently >> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at >> precisely the frequency of the known and constant TSC. >> >> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole >> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by >> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock, >> with no skew at all? > > IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is > a weird bastardization of the host's monotonic raw clock and the paravirt clock. > > Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles() > counts "cycles" in nanoseconds, not TSC ticks. > > u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc) > { > u64 delta = tsc - src->tsc_timestamp; > u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul, > src->tsc_shift); > return src->system_time + offset; > } > > In the above, "offset" is computed in the kvmclock domain, whereas system_time > comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns. > The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side > retrieval of the guest's view of kvmclock: > > hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > > The two domains use the same "clock" (constant TSC), but different math to compute > nanoseconds from a given TSC value. For decently large TSC values, this results > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. > > When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case, > it resets the base time, a.k.a. system_time. I.e. KVM takes all of the time that > was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW > domain. The more time that passes between refreshes, the bigger the time jump > from the guest's perspective. > > E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by > undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add" > and the "subtract", but the underlying train wreck of mixing time domains is > still there. > > Without a constant TSC, deferring the reference time to the host's computation > makes sense (or at least, is less silly), because the effective TSC would be per > physical CPU, whereas the reference time is per VM. > > [*] https://urldefense.com/v3/__https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org__;!!ACWV5N9M2RV99hQ!L3CeHeyvOUGoCmVUUQvSn86OuB-4ZJVQ8VEm-r5hq-stW5n41h1m0K9-M8GI_abiKujrexcj-5IpD74nBA$ > >> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really >> have to keep resetting the kvmclock to it at all? On modern hardware >> can't the kvmclock be defined by the TSC alone? > > I think there is still use for synchronizing with the host's view of time, e.g. > to deal with lost time across host suspend+resume. > > So I don't think we can completely sever KVM's paravirt clocks from host time, > at least not without harming use cases that rely on the host's view to keep > accurate time. And honestly at that point, the right answer would be to stop > advertising paravirt clocks entirely. > > But I do think we can address the issues that Dongli and David are obversing > where guest time drifts even though the host kernel's base time hasn't changed. > If I've pieced everything together correctly, the drift can be eliminated simply > by using the paravirt clock algorithm when converting the delta from the raw TSC > to nanoseconds. > > This is *very* lightly tested, as in it compiles and doesn't explode, but that's > about all I've tested. > > --- > arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++--------------- > 1 file changed, 42 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..3ba7edfca47c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz, > static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0); > #endif > > +static u32 host_tsc_to_system_mul; > +static s8 host_tsc_shift; > + > static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); > static unsigned long max_tsc_khz; > > @@ -2812,27 +2815,18 @@ static u64 read_tsc(void) > static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, > int *mode) > { > - u64 tsc_pg_val; > - long v; > + u64 ns, v; > > switch (clock->vclock_mode) { > case VDSO_CLOCKMODE_HVCLOCK: > - if (hv_read_tsc_page_tsc(hv_get_tsc_page(), > - tsc_timestamp, &tsc_pg_val)) { > - /* TSC page valid */ > + if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v)) > *mode = VDSO_CLOCKMODE_HVCLOCK; > - v = (tsc_pg_val - clock->cycle_last) & > - clock->mask; > - } else { > - /* TSC page invalid */ > + else > *mode = VDSO_CLOCKMODE_NONE; > - } > break; > case VDSO_CLOCKMODE_TSC: > *mode = VDSO_CLOCKMODE_TSC; > - *tsc_timestamp = read_tsc(); > - v = (*tsc_timestamp - clock->cycle_last) & > - clock->mask; > + v = *tsc_timestamp = read_tsc(); > break; > default: > *mode = VDSO_CLOCKMODE_NONE; > @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, > > if (*mode == VDSO_CLOCKMODE_NONE) > *tsc_timestamp = v = 0; > + else > + v = (v - clock->cycle_last) & clock->mask; > > - return v * clock->mult; > + ns = clock->base_cycles; > + > + /* > + * When the clock source is a raw, constant TSC, do the conversion to > + * nanoseconds using the paravirt clock math so that the delta in ns is > + * consistent regardless of whether the delta is converted in the host > + * or the guest. > + * > + * The base for paravirt clocks is the kernel's base time in ns, plus > + * the delta since the last sync. E.g. if a masterclock update occurs, > + * KVM will shift some amount of delta from the guest to the host. > + * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and > + * paravirt clocks aren't equivalent, and so shifting the delta can > + * cause time to jump from the guest's view of the paravirt clock. > + * This only works for a constant TSC, otherwise the calculation would > + * only be valid for the current physical CPU, whereas the base of the > + * clock must be valid for all vCPUs in the VM. > + */ > + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) && > + *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) { > + ns >>= clock->shift; > + ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift); > + } else { > + ns += v * clock->mult; > + ns >>= clock->shift; > + } > + return ns; > } > > static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) > @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) > > do { > seq = read_seqcount_begin(>od->seq); > - ns = gtod->raw_clock.base_cycles; > - ns += vgettsc(>od->raw_clock, tsc_timestamp, &mode); > - ns >>= gtod->raw_clock.shift; > + ns = vgettsc(>od->raw_clock, tsc_timestamp, &mode); > ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot)); > } while (unlikely(read_seqcount_retry(>od->seq, seq))); > *t = ns; > @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) > do { > seq = read_seqcount_begin(>od->seq); > ts->tv_sec = gtod->wall_time_sec; > - ns = gtod->clock.base_cycles; > - ns += vgettsc(>od->clock, tsc_timestamp, &mode); > - ns >>= gtod->clock.shift; > + ns = vgettsc(>od->clock, tsc_timestamp, &mode); > } while (unlikely(read_seqcount_retry(>od->seq, seq))); > > ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); > @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) > if (ret != 0) > return ret; > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, > + &host_tsc_shift, &host_tsc_to_system_mul); I agree that to use the kvmclock to calculate the ns elapsed when updating the master clock. Would you take the tsc scaling into consideration? While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about the VM using different TSC frequency? Thank you very much! Dongli Zhang > + > local_tsc = rdtsc(); > stable = !kvm_check_tsc_unstable(); > list_for_each_entry(kvm, &vm_list, vm_list) { > > base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d