On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote: > On Tue, Oct 03, 2023, David Woodhouse wrote: > > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > > > > > The two domains use the same "clock" (constant TSC), but different math to compute > > > nanoseconds from a given TSC value. For decently large TSC values, this results > > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. > > > > This is the bit I'm still confused about, and it seems to be the root > > of all the other problems. > > > > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a > > number of ticks of the TSC running at a constant known frequency, to a > > number of nanoseconds. > > > > So how in the name of all that is holy do they manage to get > > *different* answers? > > > > I get that the mult/shift thing carries some imprecision, but is that > > all it is? > > Yep, pretty sure that's it. It's like the plot from Office Space / Superman III. > Those little rounding errors add up over time. > > PV clock: > > nanoseconds = ((TSC >> shift) * mult) >> 32 > > or > > nanoseconds = ((TSC << shift) * mult) >> 32 > > versus timekeeping (mostly) > > nanoseconds = (TSC * mult) >> shift > > The more I look at the PV clock stuff, the more I agree with Peter: it's garbage. > Shifting before multiplying is guaranteed to introduce error. Shifting right drops > data, and shifting left introduces zeros. > > > Can't we ensure that the kvmclock uses the *same* algorithm, > > precisely, as CLOCK_MONOTONIC_RAW? > > Yes? At least for sane hardware, after much staring, I think it's possible. > > It's tricky because the two algorithms are wierdly different, the PV clock algorithm > is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh > at us for suggesting that we try to shove the pvclock algorithm into the kernel. > > The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. > > Compile tested only, but I believe this math is correct. And I'm guessing we'd > want some safeguards against overflow, e.g. due to a multiplier that is too big. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..ae9275c3d580 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > v->arch.l1_tsc_scaling_ratio); > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > - &vcpu->hv_clock.tsc_shift, > - &vcpu->hv_clock.tsc_to_system_mul); > + u32 shift, mult; > + > + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); > + > + if (shift <= 32) { > + vcpu->hv_clock.tsc_shift = 0; > + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); > + } else { > + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > + &vcpu->hv_clock.tsc_shift, > + &vcpu->hv_clock.tsc_to_system_mul); > + } > + > vcpu->hw_tsc_khz = tgt_tsc_khz; > kvm_xen_update_tsc_info(v); > } > I gave that a go on my test box, and for a TSC frequency of 2593992 kHz it got mult=1655736523, shift=32 and took the 'happy' path instead of falling back. It still drifts about the same though, using the same test as before: https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock I was going to facetiously suggest that perhaps the kvmclock should have leap nanoseconds... but then realised that that's basically what Dongli's patch is *doing*. Maybe we just need to *recognise* that, so rather than having a user-configured period for the update, KVM could calculate the frequency for the updates based on the rate at which the clocks would otherwise drift, and a maximum delta? Not my favourite option, but perhaps better than nothing?
Attachment:
smime.p7s
Description: S/MIME cryptographic signature