Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 04 Oct 2023 11:01:12 +0100

On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, David Woodhouse wrote:
> > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > > 
> > > The two domains use the same "clock" (constant TSC), but different math to compute
> > > nanoseconds from a given TSC value.  For decently large TSC values, this results
> > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> > 
> > This is the bit I'm still confused about, and it seems to be the root
> > of all the other problems.
> > 
> > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
> > number of ticks of the TSC running at a constant known frequency, to a
> > number of nanoseconds.
> > 
> > So how in the name of all that is holy do they manage to get
> > *different* answers?
> > 
> > I get that the mult/shift thing carries some imprecision, but is that
> > all it is? 
> 
> Yep, pretty sure that's it.  It's like the plot from Office Space / Superman III.
> Those little rounding errors add up over time.
> 
> PV clock:
> 
>   nanoseconds = ((TSC >> shift) * mult) >> 32
> 
> or 
> 
>   nanoseconds = ((TSC << shift) * mult) >> 32
> 
> versus timekeeping (mostly)
> 
>   nanoseconds = (TSC * mult) >> shift
> 
> The more I look at the PV clock stuff, the more I agree with Peter: it's garbage.
> Shifting before multiplying is guaranteed to introduce error.  Shifting right drops
> data, and shifting left introduces zeros.
> 
> > Can't we ensure that the kvmclock uses the *same* algorithm,
> > precisely, as CLOCK_MONOTONIC_RAW?
> 
> Yes?  At least for sane hardware, after much staring, I think it's possible.
> 
> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
> 
> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
> 
> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
> want some safeguards against overflow, e.g. due to a multiplier that is too big.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..ae9275c3d580 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>                                             v->arch.l1_tsc_scaling_ratio);
>  
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> -                                  &vcpu->hv_clock.tsc_shift,
> -                                  &vcpu->hv_clock.tsc_to_system_mul);
> +               u32 shift, mult;
> +
> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> +
> +               if (shift <= 32) {
> +                       vcpu->hv_clock.tsc_shift = 0;
> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> +               } else {
> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> +                                          &vcpu->hv_clock.tsc_shift,
> +                                          &vcpu->hv_clock.tsc_to_system_mul);
> +               }
> +
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>                 kvm_xen_update_tsc_info(v);
>         }
> 

I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
it got mult=1655736523, shift=32 and took the 'happy' path instead of
falling back.

It still drifts about the same though, using the same test as before:
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock

I was going to facetiously suggest that perhaps the kvmclock should
have leap nanoseconds... but then realised that that's basically what
Dongli's patch is *doing*. Maybe we just need to *recognise* that, so
rather than having a user-configured period for the update, KVM could
calculate the frequency for the updates based on the rate at which the
clocks would otherwise drift, and a maximum delta? Not my favourite
option, but perhaps better than nothing? 
Attachment:
smime.p7s

Description: S/MIME cryptographic signature